 Hi, I'm Zor. Welcome to a new Zor education. The subject of today's lecture is statistical regression. And this lecture is actually part of the advanced mathematics course which is presented on Unizor.com website. It's for teenagers and high school students interested in mathematics as the tool to develop their creativity, logic, analytical thinking, etc. And so I do recommend you to watch this lecture from Unizor.com website because it contains, the site contains detailed description notes for every lecture. So you can basically go throughout the whole course as well as a textbook and some kind of a video explanation of what this is all about. Alright, so statistical regression. Now, before going into mathematics of this subject, let's just talk about what is basically statistical regression. Well, it's all about dependency between different things. Let me just give you a few examples of how one thing depends on some other thing. Now, if we will be able to establish this dependency and we can state that this dependency is relatively tight, then knowing one we can actually predict or explain another. So that's basically the reason why we are interested in statistical regression. Now, so let me start from something which is very, very simple and it represents a very, very strict like formula related dependency between random variables, between two random variables. Now, consider we are measuring a temperature with two termometers at the same time and at the same place. But these two termometers are, one is measuring in the units of Celsius and another in units of Fahrenheit. So we have two different random variables because at different times, at different places, we have different values, right? So however, if we take these two termometers together and we measure the temperature by both of them, then we know that there is a formula that the degrees of Celsius are related to degrees of Fahrenheit with this formula. Now, this is a very straightforward, direct dependency of one on another. So knowing this one, we can definitely say that we don't really have to measure it with another thermometer. We can just calculate it. So this is a dependency which is very, very strict or strong or whatever you want to use this word. Now, there are some cases when this dependency exists, but it's not really very deterministic. Here is my example. Let's consider two different random values. One is a volcanic activity during the year. Well, I don't know how to measure it, but let's consider that we can measure it in some units. Maybe amount of dirt which volcanoes are actually putting out on the surface or whatever measurement. And let's say the average temperature during the year for an entire planet, let's say. So we have a certain number of points and every day at a certain time we measure the temperature and then we are averaging throughout the whole year. So we have a whole year of volcanic activity which is basically measured by one number and the whole year of temperatures which we are averaging. Now, year after year we make measurements of these two random variables and we have certain statistics. Are they related? I mean, we would like to establish certain procedure which would help us to analyze this relationship because if we are able to do it, we can actually do certain predictions maybe for the temperature based on some volcanic activity which we observe. So in this particular case there is some relationship but we don't really know what it is. It's definitely not like this one. There is no formula which relates to these things, but we would like to know what kind of relationship can be established, if any, and if it is then how exactly it would be expressed in some way or another. Now, a few more examples. Okay, this example is the example of cause and effect. So in this case let me suggest to the following. Parents are paying for an education of their child. Now, is there any relationship between the amount of money they spend and the amount of knowledge their child will absorb? Well, maybe. I mean, it's reasonable to assume that more expensive school would probably provide better education. But then there are some other factors which are not really related to this particular to money, let's say. For instance, how well this particular student can absorb the information which is given to him. How much time he spends for all this. And there are many other circumstances which are involved. So again, there is some kind of a relationship and in this case this is like a cause and effect relationship. The question is, is it reasonable to assume for the parents that okay, next year if we will spend more on education, we will get better results. Because if they can afford it, then yes, it does make sense. Alright, so basically now we are interested in this relationship for the purpose of making certain decisions which are very, very important. Another example for this type of relationship is weather forecast. Does the weather tomorrow depend on the weather today? Let's say we are measuring temperature tomorrow and today. And then we are repeating this day after day. Now, if we have the temperature today, can I predict the temperature tomorrow with certain level of precision? Well, again, we can analyze how this relationship between temperature today and tomorrow will be aware historically. And then if we can establish such relationship, maybe it also depends on some additional parameters like whether it's morning or afternoon or some other thing. But in any case, if we will be able to establish this type of relationship, it will help us to forecast the temperature for the next day or something. So it does make very practical sense to know, to express in some way this relationship. Now, whatever, what else I have. Okay, my last example is about medical treatment. Now, we have obviously certain illnesses which are treated in many different ways. You can treat it with drugs and there are many different drugs probably for the same illness. We have, for instance, different foods, different style of life, more active, less active, exercise, etc. They're all involved in the process of treatment of certain illnesses. Now, which one is more important? Which one is less important? So we have to really measure, like for instance, we have three different drugs. So we're probably collecting certain statistics. We are administering these drugs to different people. And then we see there is a relationship between, let's say, the time the patient gets better for using one drug versus another drug. So this type of relationship is very, very important for, again, very practical reasons treatment of the people. So I think I have convinced you that relationship between random variables is very important and it would be nice if we can measure it in some way. And that's what this lecture will be about. Now, obviously, whenever we introduce mathematics into real life, it needs to be, real life needs to be simplified as much as possible. Otherwise, mathematics will be horrendous. And in this case, I'm going to actually offer a model, which would be a very simple model of, well, some practical situation. And based on this model, I will talk about statistical regression and in particular about linear regression. So what exactly the model I'm talking about? Let's consider that you have two random variables, random variable X, which we will call independent random variable, and random variable Y, which will be dependent one. It can be like temperature today and temperature an hour from today, or it can be a particular drug which we are administering. And this will be the how well this drug actually works on the patient, or whatever, different reasons. What we would like to establish is some kind of a linear dependency. That would be great. I mean, if we will establish linear dependency between these things like this, that would be terrific. So if we will have, for instance, some statistical information, X1, X2, etc., Xn, and Y1, Y2, Y2, etc., Yn, and all of them are lying on this particular line, because this is just a line. That's why the whole dependency is called linear regression. Well, that would be terrific. But obviously that's not practical case. And in practical life, there is always some kind of a deviation. No matter how we try, this is X, this is Y. So no matter how we try to put our line, and these are correspondingly X1, X2, X3, X4, etc., and these are corresponding Y1, Y2, Y3, Y4, etc., so no matter how we draw the line, we cannot put all the points on that line. There is something always higher or lower that line, which means it's not really such a relationship. However, if you will put it on a graph, it might look like maybe there is some line which is relatively close to all of them. So our purpose is to find such a line, which would be the closest to all of these points at the same time as possible. How can it be done? Well, usually the way how it is done is the following. Let's say this is the line, and you have always certain deviation from this line, up or down, right? So if you will take all these deviations, and since they are either positive or negative, let's square them and edit them together, that would be actually the measurement of how close the line is to all these points combined. So it's least square method, so to speak. So if we will be able to find the line which would minimize the squares, some of the squares of all these distances, then we will probably be satisfied. Now then we can actually check this sum of these squares of the distances, and if it's relatively small, and again need some explanation of what it means, relatively small, then we will accept this line as some kind of a good relationship which can be explained in approximately, can be explained in this formula, and it can be used actually for some future results. If these are, for instance, results of some drug treatment, then we can say, okay, yes, it looks like it works, and we can establish this drug as the good one, and then we can put additional people for treatment with this drug, and we would expect this result to be somewhere within this line. Alright, so let's again go back to mathematics. This is all nice philosophy. Mathematically, it looks like this. I assume that there are some A and B which are basically combined into this formula. They are not exactly the results of the random variable Y, but if we will add some random variable which basically characterizes how far my line is from actual values of Y. So this is also some kind of a random variable, and what I assume is that this random variable characterizes our arrow, and it's a normal variable with mean 0 and some kind of variance sigma square. So our purpose is, now, if you think about what is variance, well, variance, considering this is not really, we don't really know the random variables, we know statistics, so instead of X, we have X1, X2, Xn, which we observe, or we set. Instead of Y, we have Y212, etc., 1n, which we observe, and then epsilon would be a difference between them. And if we are talking about the difference in square and then summarized throughout the whole number of experiments, that actually would be the sample variance, right, for epsilon. So our purpose is to find A and B to minimize this sample variance, which is a difference between Y and AX plus B square summarized throughout whole experiments. Alright, okay, so we have this particular problem formulated. We have two random variables, which we assume to be of certain distribution, and we assume that the difference between AX plus B and Y is some kind of a normally distributed random variables with average, with mean value zero and some variance, and we would like to find A and B in such a way that would minimize the variance. And now let's go to statistics, we don't know the distribution of Y and X, we know the statistics. So we have Y1, Y2, etc., Yn, and we have X1, X2, etc., Xn. Okay, so how can we find? Now we have two variables which we have to find out. So first of all, let me make the following simplification. Now, considering these are two random variables, and epsilon is normally distributed, let me have a mathematical expectation of left and the right side. So forget about statistics for a while, now we are in theory of probabilities assuming we do know the distribution of Y and X, right? Then the mathematical expectation of Y should be equal to A times mathematical expectation of X plus B plus mathematical expectation of epsilon, which is actually equal to zero, right? Because we assumed, right? So that's very good actually. We have gotten rid of epsilon which we don't know anything about, but now what happens is we really don't know mathematical expectation of X and Y, but we do know their statistics. So instead of this, I can assume that at least approximately this particular equation takes place. So instead of mathematical expectation of X which I don't know, I'm taking the average, which is a good approximation for mathematical expectation. Same thing here. And these we do know because we do have statistics historical data or whatever else already accumulated. And that's how, let's call this thing V and this thing U. So we know these two numbers, right? So I can now say that V is equal to A U plus B. So B is equal to V minus A U, which means my original equation can be reduced to only one variable. Since B is expressed in terms of A with two known numbers, these are just constants. This is average of X's and this is average of Y's, right? So I can say that I'm actually looking for equation of this type, V minus A U plus epsilon or I'm adding minus V to both sides. I will have Y minus V equals to A X minus U plus epsilon. Now, why is this simpler? Well, first of all, because I have only one variable to find instead of two, right? Now, Y minus V and X minus U, I can assume that these are new random variables with mathematical expectation of zero. So I have two constants. These are empirically obtained averages of my observations of X's and Y's, U and V. So these are constants. So I know them from historical observation, right? So I'm looking basically for Y and X in such a way that now I have only one parameter to determine. And obviously, I have statistics for this and I have statistics for this, right? So this is Y minus Y1 plus, etc. plus Yn divided by n. Y2 minus Y1 plus, etc. Yn divided by n, etc. Yn minus average. So I assume that I have these statistics. So instead of my original Y1, Y2 and Yn, I will take new statistics which are these ones. And the advantage is that the average is equal to zero. Now, same thing here. Instead of X1, X2, etc., I will have X1 minus X1 plus, etc. plus Xn divided by n. X2 minus X1 plus, etc. Xn divided by n, etc. Xn minus X1 plus Xn divided by n. So this is my new statistics, which I have to use to find only one parameter A. Now, how can I find it out by minimizing the variance of epsilon? So I have to basically minimize. Okay, what I will do is I will use different letters. Well, the same letters with a bar on the top. So what's the difference between X2, let's say, and X2 with a bar? Because X2 with a bar is X2 minus average of all Xs. So I will use now a completely new equation which looks exactly like this one but without the B. And I will assume that my random variable Y and X have zero as this mathematical expectation. So if I will be able to do this using my statistical observations, X1, etc., Xn with the bar on the top and Y1, Yn with the bar on the top. So I have these statistics and I assume that there is some kind of relationship. I have to find A to minimize variance of epsilon. Okay, now this is actually easy. I have already done everything because what should we minimize? We should minimize Y1 minus AX1 square plus, etc., plus Yn minus AXn square. That's what we have to minimize because this is basically a sample variance. It's not sample variance. I mean, I have to divide it by n minus 1. It doesn't really matter. It's all constant. So I have to minimize this to find my A where X1 and X2 and Y1 and Y2, etc., with the bars are known numbers. They are obtained from original numbers, X1, X2, etc., Y1, Y2, etc., by subtracting their average. So I know all these numbers. They are all known numbers. A is unknown and I have to find A to minimize it. Well, this is actually simple because what do I have now? I have a quadratic polynomial of A, right? So let's just rearrange it so it looks like real quadratic polynomial. So first let's put A square. So what will be in A square? It's X1 square, X2 square, Xn square. So it will be, that's what it is, sum of XI square. Now what would be with A? It's 2A minus 2A X1, Y1 minus 2A X2, Y2, etc. So it would be minus 2A sigma XI, YI from 1 to n. And finally my free member of this polynomial is Y1 square up to square. I have to put these bars on the top. Okay? And I know these numbers. Again, what is each XI with a bar? It's corresponding XY minus average of all XY's, etc. So I know all these numbers. And how can I determine my A, which would minimize this thing? By the way, it's minimized because you see it's a polynomial with a positive coefficient. So it's a parabola. And this is, if you have A square P plus AQ plus R polynomial, now where is this point? It's minus Q divided by 2P, right? I hope you remember it from the quadratic polynomial. So in this particular case, A, which minimize our polynomial is equal to minus Q, which is 2 sigma XI, YI divided by 2 double this, 2 sigma XI square. Now obviously 2 can be reduced. And that's the answer. So we have found the A, which minimizes variance of epsilon. Well, in graphical way, we have found the slope of the line, which would approximate in the best possible way all the points where XI and YI are positioned on the plane. Now from this, we can obviously find the value of B. So if I will go back to my original, if you remember when there was a B here, my mathematical expectation of Y would be equal to a mathematical expectation of X plus B, right? So B is equal to mathematical expectation of Y minus A expectation of X. Now A, we know, it's this one. Now again, instead of expectation of Y and expectation of X, we can take their average and that's how we find B. Great. Now we found A and B. What's next? Well, next is actually we have to find out how good our approximation with the line actually is. And the quality is it's this value where A is exactly what we have found, right? So by substituting A into this formula or into this polynomial, it doesn't really matter where it is. We can find exactly the value of this one, the value of variance of epsilon, which basically characterizes the quality of this approximation. I mean, it would be ideal if epsilon is equal to zero. Obviously, it never happens. Well, except if you're measuring Fahrenheit and Celsius. But the closer to zero is, it's better. But now, it's probably not a good idea to measure it in absolute terms. What do you mean closer to zero? Is one close to zero or not? Well, relative to what? Relative to a million? Yes, it is. Relative to two, it's not. So we probably have to compare the knowing this or this as a sigma square. Well, if you divide it by N minus one, obviously, you have to divide it and get real variance. So knowing the variance, we can take the double square root of the variance, double standard deviation, as 95% certainty that the epsilon would be in that particular area. So we know the interval, epsilon is from two, with 95% probability. And now, we should really measure this interval relatively to average value of y, which we again know from empirical data. Now, if this two sigma interval of epsilon is relatively small to y, and in this case, I can say what relatively small is. For instance, we used to think that the 5% is not such a big deal in many practical problems. So if this double sigma is, well, actually it's probably four sigma because it's two sigma to the left and two sigma to the right. So if the four sigma is relatively small, like less than 5%, less than 0.05 from the average of y, then we can consider that, okay, I think we are within a very good margin of error considering the y is actually somewhere around this linear function of x. And that means it's a good thing. If, however, this particular variance or rather standard deviation is too big for y, let's say it's half of the y, then our estimate is really very, very wrong. And obviously, in all the practical situations, people really do this type of calculations. So they can say that, okay, with a precision of no more than 3% or 5% or whatever, the values of y are lying very near the line which defines this linear dependency. So by setting, for instance, certain values of x as parameters, we can expect within this precision certain values of y. For instance, certain dosage of some medication allows to expect the treatment to be, you know, I don't know, whatever, some number of the lengths, let's say, of treatment or something like this. Or if we will invest a certain amount of money in our child education, you will get approximately so much knowledge with certain error. All right, now, not everything in the world is linear, of course, we have made a tremendous simplification. Sometimes there are square functions, what if it's a x square plus plus b x plus c, who knows. And these are all complications. Again, I wanted to present the case in the simplest fashion. So you will just feel what exactly the relationship between two different variables can look like in a simple case. And that's exactly how people are trying to model certain things. And again, some examples were in the very beginning of this lecture. So basically that's it. Try to go through notes for this lecture on Unisor.com. And if you can come up with some interesting examples and real calculations, which I'm kind of reluctant to do, but maybe I will do some spreadsheet calculations, in which case I will put it in the notes for this lecture. And you will see exact numbers, how it all really look like in some practical cases, I hope. All right, that's it for today. Thank you very much and good luck.