 So, this is actually familiar to all of you those who have done any experiments gathered data and try to infer something about relationships between inputs and outputs. So, familiar situation of curve fitting to observe data to make explicit the relationship between some parameter of interest which we want to infer as an output with respect to some input parameter or something that is in our control or something that we want to see the dependence. So, in any experimental setting whenever we vary some inputs and then measure appropriate outputs and try to sort of plot those two things and see the see the relationship this type of model is quite familiar. So, we will just try to give it a mathematical structure and some statistical basis for doing this analysis. So, in fact the term regression itself we will explain later on, but it is for historical reasons. So, actually regress in English means to sort of go back to. So, why that term is used here is an interesting historical accident or not exactly accident. I mean it is a consequence of how this model was first proposed and we will see that in the next class which I think will be the last class. So, let us get started. So, the introduction is that we would like to explore the relationship between variables based on some observed values. So, and this relationship will subsequently be used for prediction or for analysis. So, often regression is used in forecasting situations where you want to make some predictions of the future. So, in that case the observed value is what you are trying to forecast and the dependent value or the sort of you can call it the independent variable is time. So, you see how certain values behave with respect to time and determine some relationship and as time progresses in future also you sort of extrapolate and then make some prediction or forecast about the value of that dependent variable in future. So, regression models are used in forecasting for example, or in analyzing the quantitative behavior between independent variables and dependent variables. So, the simplest situation is that there is a response variable which we will call y and an input variable which we call x. So, it is sort of we look at a dependent variable y with respect to the values of an independent variable x. So, you sort of we plot y versus x which is the sort of natural way of visualizing it. So, the setting is that we see we are able to observe n paired observations I mean n paired values. So, a pair of values x y is observed in n instances. So, here is an example some group of scientists and engineers are trying to improve emission quality in vehicles and they are proposing that some additive in gasoline will improve the emission quality. So, that is the setting. So, here the task is to explore the relationship between amount of the additive and reduction in the emission. So, of course, these things will not have an exact relationship because there are so many confounding factors and you know there are other ingredients and there are environmental conditions and so on. But the main factors that we know about the other factors we will try to keep it fixed during the study and vary the amount of this additive and see what happens to the emission. So, we are able to conduct a few trials for this study in order to establish whether the introduction of this additive into gasoline will improve the emission quality. So, how do we plan and conduct such an experiment? We have to try to see that the measurements are made on the same vehicle or the same type of vehicle and over a period of time. So, that you know we are able to get some reliable numbers, but obviously you know we cannot just pick up randomly some observations we try to control the conditions of the experiment. So, we let us say we select several new automobiles. So, that at least the wear and tear is the same and they are as new as possible. So, that they are comparable and same brand. So, that they have the same performance characteristics and so on. So, we select those as the experimental unit and we measure the extent of nitrogen oxide in the exhaust of the car. So, that is one of the things leading to the pollution. So, we measure first without the additive and then with a specified amount x of the additive and we do observe some reduction. So, what we are interested in doing is quantifying the extent of reduction these are with the extent of the additive in the fuel mix. So, that is what we are trying to quantify. So, let us say we have we have money and time and the sources to conduct say 10 experiments. So, in different settings. So, we have managed to conduct these 10 experiments. So, in the bottom you see the table. So, for different amounts of the additive we are able to collect some response variable y for values of x. So, what you observe is that there is a certain range of the values of x in this case 1 to 1 and you would see that some of the values are repeated. So, 1 is repeated, 4 is repeated, 6 is repeated in the input and what you see is that for the same value of the input you get different you get different output. So, for the amount of additive x equal to 1 you get 2.1 or 2.5. So, this is because the output is not necessarily completely determined by the input it may have some random variation also. So, the extent of randomness is going to show up in the output values that we see. So, for example, even for the value 4 we see that we get values of y equal to 3.8, 3.2 and if we take further reading we may get another spread of these values. But broadly speaking you can see that there is a there is a sort of increasing trend as the as the value of x increases the value of y also increases broadly speaking. So, for example, the value of x equal to 6 in one of the readings is 3.9 and the value of y for x equal to 5 is 4.3. So, it is not guaranteed that if you increase x y will increase because there is an inherent randomness in the output is that ok. So, we are able to see that there is a broad broad increase and we want to quantify that that base level phenomenon of increase with respect to the input variable from this data itself we recognize that there is an inherent randomness. So, in fact, one of the one of the interesting I mean one of the first questions we ask in regression models when we are able to put down the specifics of the model is is there really a dependence. So, once once we recognize that there is some randomness in the in the output with respect to the input is there really a dependence I mean or is it just that you know the whole thing is just some base value superimposed with some randomness or is there really a a a a linear dependence ok. So, the the the scientists in this case would would would claim or are hoping to establish that as the amount of additive increases the emissions come down ok. So, they they claim that as x increases y increases of course, then then it is of interest to see how much is the increase I mean at at what at what ratio does it increase, but first thing is to even establish whether it is an increasing dependence that means as x increases y increases. So, we will we will come back to just looking at this data little more closely, but let us just get some some terminology. So, different different terms have been used over the years. So, the the value x in the in the previous this thing which is the amount of additive is the control variable or the controlled variable also called the independent variable or the predictor variable and we denoted by x and the effect or the response variable or the dependent variable is denoted y ok. So, in simplistic terms you know y is a function of x and we are trying to determine what is a what is f what is that function that is what we are trying to determine. So, one one important thing in this regression study is that there need not be a causal implication it is not that physically y is a result of x. In in this example the the scientist would would be claiming that it is because of the additive that it is observed that there is actually a causal relationship that there is a cause and effect relationship, but that need not be so it need not be so especially when we are using it for prediction in in in application where we use regression models for prediction this need not be so I mean I may I may develop a regression model linking say some health indicator with some other health indicator ok. So, incidence of a certain type of disease with certain other type of disease ok over the years and I may find that there is a there is a strong relationship between these two and I may have independent estimates of how one of these incidences is going up and from that I may want to infer something about how the other is going. Now, it is not it in this case it is a bit far fetch to imagine that one of them is causing the other for example, there could be a common underlying cause like unhealthy lifestyle or something which is causing both these things. So, you know the cause is it is not that one is causing the other there could be an underlying variable z which is causing both x and y in a in a causal causal way I may be able to build a relationship between y and x and let us say I have independent ways of of estimating how x is likely to grow over time and I may want to use that to infer how y is likely to grow over time. So, if I have a relationship between y and x then using estimates of x I can get estimates of y. So, this is an example where there is no causal relationship between y and x they are both related to some other underlying cause, but I may still be able to build a regression model and use it is that ok. So, there need not be a causal relationship, but you know often it is satisfactory to use this model when there is some sort of relationship between y and x. So, supposing I claim that there is a relationship between you know number of deaths because of smoking attributed to smoking in India and you know the number of new homes built in Sydney Australia every year. Now, this could be true may be may be more homes are being built in Sydney Australia every year and more people are dying because of smoking in in India every year both of these may be growing and I may I mean I may discover that there is a wonderful relationship between these two and because I have estimates of new homes being built in Sydney Australia I can infer something about deaths due to smoking in India, but that is a bit far fetch I mean it may be mathematically true I may find that there is a very nice correlation between the two, but that is not very very satisfactory to explain and use, but I am only saying that once we get into the model there is no implication of a causal relationship subsequently found to interpret and use the model it would be nice to have a causal relationship either a direct causal relationship or an indirect one through a common cause for example. So, just to emphasize that regression does not mean that there is a cause and effect it often is a setup in that manner, but it it is not necessary for that. So, so what we are studying in a sense is a more sophisticated form of curve fitting with careful assumptions and analysis is that ok. So, returning to this example you know the first step in all these data analysis type of exercises is to sort of plot the data and look at it and visualize it. So, this scatter diagram so you know what we would what we would do in an experimental setting is to plot the independent variable x on the x axis and the dependent variable y on the y axis and look at what the plot looks like. So, we see this this set of values which is plotted here. So, there is a certain range of values of x 1 to 7 and some range of values of y and we is anyone looking at this would would say that there seems to be an increasing relationship between these two. So, in other words if we if we quantify this the form of y equal to m x plus c then that that m is positive that is what we would sort of infer, but we can clearly see that there is some variability there is no there is no one straight line that will fit all these points. So, it is not a it is not a perfectly explained numerical relationship, but it seems to be growing. And one of the underlying assumptions about linear linear regression is that the the dependence of a particular form which is hypothesized up front. So, for example, here suppose we assume that the relationship is of the form y is equal to alpha plus beta x plus some error term. So, so the the response variable y i in the in a in a given experimental observation is related to the value of x i the corresponding value of the input variable or the control variable by the linear equation y i is equal to alpha plus beta x i plus some error term e i for different values of x i we have observed values of y i. So, x 1 to x n are these set values of the controlled variable which the experimenter has selected for the for the study and y i are the corresponding output variables which the experimenter has measured ok. So, now we may or may not have a full range of values x i because for example, we may have gathered this data from trials conducted in various places subsequently we may have just picked up that data. In some cases we may have we may have been able to design the trial and and and commission the study and the measurement in some cases we are we would just be able to collect the data as available. For example, you know we may want to hypothesize a relationship between some some economic activity like GDP per capita or something some some economic output measure per per person as a function of some health indicator of the population. So, supposing some some economists makes the claim that the GDP per capita is is a certain function of the health index of the population measured in some way like say infant mortality or or any any health indicator ok. So, the number of babies who live beyond a certain age that if that is good it means that the general health conditions are good general economic conditions are good there is a good government and. So, therefore, that would indicate that the economic activities is in good shape. So, they might want to create a relationship between these two things ok for whatever purpose for planning or for for justifying something. Now, you cannot you cannot play around with the x x index right I cannot I cannot commission a study in some country where the where the health index is something I cannot play around with human lives like that right, but I may be able to observe I I go to various countries and I commission studies and find out that in this place the health indicator was x i and the economic indicator was y i I make a note of it. So, I do not have control of the experiment I just get whatever data I can and you know. So, if I if I get a range of values of x i because conditions in the world are are different I may not have equally distributed values of x i in the entire range there may be some clusters in some countries because of conditions some other cluster of values of x i in some other range. So, I may I may have say I may have 20 readings of x i 20 observations of x i and corresponding values of y i those 20 values of x i may not be uniformly spread over the range because you know I do not have any control of it really. So, although I call it the controlled variable it is just something that I I I observe as the input variable and correspondingly I measure the output variable which is the response variable. So, I may not have a choice in how how the input variables are are spread over the range and even repeated values are ok. In fact, it emphasizes the fact that there is an inherent variability in the in the in the output. So, in the example that we saw we got some range of values between 1 and 7 and there are repeated values at 1 4 and 6 that is ok. So, now the I mean one of the key things in in these regression models is the the nature of randomness which is sort of inherent in the in the model. So, these these values e i in this blue equation they are the unknown error components now we do not know what those error components are. So, in fact, if we if we are able to relate what we are doing with this data in this blue equation what is known and what is unknown. So, in in you know in ordinary mathematics we use x and y to denote variables which are you know to be solved for or to be determined and you know constants like a b c alpha beta and all that as parameters which you know are known, but for historical reasons here it is the other way round here actually x i and y i values i equal to 1 to n they are known they are what we observe. So, we we know x i we know x 1 and corresponding y 1. So, we know the pair x 1 y 1 and so on up to x x n y n what we do not know are the values alpha beta and the e i's we we do not know the that also ok. So, there is some some relationship between the x i and y i which is captured by alpha plus beta x i, but if y i is equal to alpha plus beta x i then y equal to alpha plus beta x is a perfect linear relationship, but we can see from the data that that is unlikely to be the case there is going to be some error term which is e i. So, that also we do not know. So, we do not know which part of the observed data is the error part of it and which part is because of alpha and beta because we are not yet we have not yet determined what is this alpha and beta. So, in this whole exercise e 1 to e n are unknown error components which are superimposed on the true linear relationship we do not know what is the true linear relationship, but once we hypothesize that true linear true linear relationship there is some error term superimposed on that. So, these these are unobservable random variables we just assume that with respect to the true linear relationship these are some normally distributed error terms with mean 0 and variance some variance ok. So, it is a leap of faith that we we look at the data and we say this looks like a linear relationship on which some error terms are are superimposed. So, we will see a drawing of this, but that is the that is the starting point for the linear regression model. The parameters alpha and beta are unknown and if we assume that these error terms are normally distributed with mean 0. So, why mean 0? Because mean 0 because we the underlying model is neither an overestimate nor an underestimate of the true relationship. So, otherwise we can always shift the model a little bit if if the errors are not distributed with mean 0 we can always shift the model a little bit to get mean 0, but variance sigma square is going to be there some variance is there. So, the parameters alpha beta and this variance parameter sigma square they are unknown this this determining these unknown parameters in some logical way is the first task in linear regression models ok. So, this is the this is the way that we can think of visualize the model that for different values x i and x j I mean for different values like say x i x j we observe y. So, what we assume is that. So, we assume that you know for a for a given x i y i is something. So, there is under underlying model which is a straight nice straight line dotted straight line, but because of some randomness inherent in the phenomenon y i could take on some value I mean it will not take on exactly this value it takes on some value with some distribution. So, it takes on a value let us say normally distributed with mean equal to this and some some distribution. So, it can take on a value higher than this or lower than this. So, we plot a sort of normal distribution with some sigma square variance. So, we might let us say we get this value or some some other time when we take the reading we could get this value like that is that ok. So, broadly speaking it is an increasing thing, but superimposed with some variance. So, that is the that is what this curve is trying to show that there is a linear relationship between y and x encapsulated by the equation y is equal to alpha plus beta x. So, alpha is x I mean y intercept beta is the slope superimposed with some random variables e i. So, the observed value y i is equal to alpha plus beta x i plus e i a random error term. Now, only thing is we do not know what is alpha beta and is this error term. So, that is the task. So, we have this normally distributed errors. So, actually there are lots of assumptions in this model as you can see in particular this assumption of normally distributed errors and the fact that the variance remains the same as you vary x i that is a fairly strong assumption. See normally distributed is ok I mean because you know it is a what we observe of y i is some some relationship with x i plus a lot of other things. So, the lot of other things we assume they are many small things which add up to the error term and in a given situation some could be leading to increase some could be leading to decrease otherwise we can always recalibrate and get mean 0. But something like the central limit theorem can justify normally distributed errors ok, but and with mean 0 by appropriate calibration, but the fact that sigma square is the same as you vary x i. So, in particular for example, if x i is the time variable if x i is time x is time then the fact that the variance is the same as time progresses that is a fairly strong assumption. So, in a branch of applied statistics called econometrics which statisticians and mathematicians and economists sort of get together and use for constructing time series models of data for planning purposes for forecasting purposes for prediction. This is a big assumption in linear regression models that the variance is the same as x i varies. So, that is I believe that is called homoscedasticity. So, that is a heavy duty term to indicate that the variance is the same as you vary the x value. So, is this is this depiction of the of the data. So, if you if you go back to this so it is something like this there is an increasing trend in the x y versus x relationship, but there is some error term which perturbs a little bit. So, the God of the process is is generating this or nature is generating this in the following way given an x i the God of the process will know what is alpha and beta and will compute alpha plus beta x i then will pick a random number from some table with mean 0 and variance sigma square normal random variable with mean 0 and variance sigma square added to that alpha plus beta x i and then give it to you and that is what you observe. So, think of it as a game where you give you give the oracle or the all knowing generator of the process x i that person with secret knowledge of alpha and beta will compute alpha plus beta x i, but just to confuse you will add some random term E i with mean 0 and variance sigma square and will give you the answer y i which is the sum of this. So, alpha plus beta x i plus E i. So, what you see is y i. So, you give x i you get back y i that is what you are able to observe and you plot it now your job is to determine alpha and beta and an estimate of sigma square which explains this process as best as possible. So, finally, you would use this model in a cautious way that if x i is so much then what is y i you would make a prediction and from this data you will see that the prediction is subject to some error because you know even for a even based on this observation if x i is equal to 4 then you cannot say exactly what y i is going to be it could be in some range, but you are able to give a mean value for that and also given an estimate of the likely error is that ok. So, that is the task of linear regression to determine this alpha and beta and also estimate this error term sigma square. So, this is the schematic representation of the relationship between y and x. So, in terms of the model alpha is the y intercept beta is the slope both of which have to be determined. So, this you would have used in as a principle in your experimental work and curve fitting. The criterion for determining the regression line one of the criterion which is used is this principle of least squares. So, we determine the unknown parameter alpha plus beta. So, as to minimize the overall discrepancy between the observed values and the predicted values ok. So, this is a well known principle which you can easily relate to that. So, supposing these are the values that I see I vary x i and compute y i and these are this is what I plot. So, I can fit a line like this I can fit another line like this I can fit various lines. Now, for each of these lines there would be some error with respect to the observations some observations are below the line some are above the line. In fact, some observations will always be above the I mean had better be above the line and some below the line. If all the observations are above the line that you draw then the line should be shifted upwards right and vice versa. So, ultimately you will find some sort you will probably you will propose a line which is a sort of compromise that some of the values are overestimated some of the values are underestimated. So, what is the criterion for deciding this line well you want to penalize over estimation and under estimation and you want to do this in a sort of combined way. So, one of the ways of doing it is true look at the error term which is y i minus a plus b b x i. So, if you a and b are estimators of alpha and beta. So, I propose a line a plus b x. So, I propose the line y is equal to a plus b x. So, that is supposed to be an estimate of the line the real line y is equal to alpha plus b beta x. So, a is an estimator for alpha b is an estimator for beta. So, supposing my line is y is equal to a plus b x then the observed value y i and the predicted value a plus b x i the difference between the two I want to take as my measure of fit and because I want to penalize both deviations above and below one of the ways is to square it and then and then minimize it minimize that total deviation. I can do other things I can take absolute value of the difference I can do other things but square is one thing. So, square one of the things is that you know large deviations from this fitted line you want to penalize a little more also. So, you know you would like to give a little more weightage for large deviations you want to penalize it heavily. So, you take the square differences. So, square has got the advantage that both positive and negative deviations are penalized. Of course, it also says that the penalty grows more and more as you as you deviate. You could also think of taking the cumulative absolute error where you penalize both I mean one thing is clear that just looking at y i minus a minus b x i just some of that that is not a good thing because then positive errors and negative errors will cancel out and you know your you might get a very bad line. So, I do want to penalize the errors in both directions. One of the ways of doing it is to take the square error. Now, this has got some some logical basis and also it has got very nice mathematical properties. In particular the error function is is differentiable and has got some smoothness properties which the absolute value error term sum of absolute value will not have. So, for for a combination of practical reasons and mathematical reasons we this principle of least square error term is is used at least as the first cut regression models. So, this this a and b they are unknown quantities in this expression and they are chosen to minimize this error term as best as possible and they are then called the least square estimators of a and alpha and beta. So, that is we sort of qualify that they are estimators of alpha and beta obtained from the least squares principle least squares error minimization principle there could be other estimators of alpha and beta. So, this least square error estimator is. So, if you think of this right hand side as a function of a and b. So, remember that y i and x i are known in this a and b they are just two constants y i and x i there are n of them n pairs. So, this summation will will take all of them into account. So, all the y i's are known x i's are known a and b are not known. So, if you treat this right hand side as a function of a and b I would like to find a and b. So, as to minimize this function. So, in in case you have a perfect linear relationship then all these terms will be 0 then you know I find a and b to sort of minimize this I mean the the lowest possible value of the right hand side is 0 because the square of something. So, it has to be non negative. So, in case it is a perfect linear relationship then I may be able to find a and b which is which would make this right hand side equal to 0, but that is unlikely. So, I would then settle for whatever makes this value minimum ok. So, the right hand side is a nice quadratic function of a and b two variables a and b. So, what is the minimization principle? So, I take partial derivatives with respect to a and b and set them equal to 0 right. So, this this again you would have you would know from from your maths courses or your other background I hope everyone would know this that if I have function of two variables which are smooth I mean differentiable twice continuously differentiable then the the necessary condition for a minimum is to take the vector of partial derivatives and set it equal to 0 or you know just take each of the partial derivatives and set it equal to 0. In this case you would get two equations one for each partial derivatives one for each partial derivative and each of those two equations would be functions of a and b. So, you would actually have two equations in a and b and hopefully you can solve them and get a get a solution. And in this case you can actually solve them and get a unique solution and it will it will be it will be a minimum it will not be a maximum or a saddle point or a point of inflection or something it will turn out to be a minimum. So, these are all nice properties of the least squares principle. So, is that ok? So, so just to reinforce the the the proposed value is alpha plus a plus b b x i and what is observed is y i. So, the deviation is y i minus a minus b x i. So, that is to be minimized for some choice of a and b the cumulative of these deviations is to be minimized. So, this this is called the sum of squared errors the right hand side is the sum of squared errors y i minus a minus b x i whole square. And just to repeat that y i and b i y i and x i are known a and b are unknowns. So, they are to be determined. So, the sum of square error terms or sum of squares and the optimal values are obtained by setting partial derivative with respect to a and b equal to 0. And the result is a unique usually unique I mean in some very degenerate cases you can get something, but unique and definitely it is a minimum it is not a saddle point or maximum. And you get those two equations in two variables which are called the normal equations. So, these turn out to be the normal equations. So, that this is a this is a question of simple algebra you take you take that previous expression for s s e and take the partial derivative with respect to a set it equal to 0 take the partial derivative with respect to b set it equal to 0 you get two equations and these are the two equations. So, this you can just verify by actually doing it yourself this is simple. And those who are interested in the mathematics of these it turns out to be a nice property that you first of all you how will you how will you check whether the solution to these equations is unique non unique. So, I have two equations in two variables I think everyone would know how to solve systems of linear equations. So, two equations in two variables this is about the simplest that you can get. So, you would look at some rank of that matrix or determinant or whatever test for uniqueness of the resulting solution. And you can verify that it will have a well defined numerically stable solution how will you ensure that it is a minimum and not a maximum that is the values of x i which you sorry values of a and b that you get will result in actually minimizing these errors and not maximizing the errors. So, one thing is see there is no question of maximizing the errors because I can I can just choose a and b to be either very large or very small and this square term will blow up. So, obviously there are I can choose a and b to make this error term as large or as you know as large as I want. So, there is no maximum it is a square term. So, I can I can always choose a and b to in fact, I can fix I can fix a and choose b to make this term as large as I like. Similarly, I can fix b and choose a to make this term as large as I like. So, there is no maximum for this it is a function which is unbounded. In fact, for a fixed b a can be made very large and this term will be very large. For a fixed b a can be very small and this term will be very large. So, for a fixed b this term is it is a u shaped function and goes off to infinity in both directions is that ok and similarly for fixed a. So, in other words it is from just by looking at it since it is a square term I can imagine that it is going to be a bowl shaped function of some type. So, there is no maximum anyway it goes off to infinity. So, chances are that what we get is is going to be a minimum. If at all there is any extremum of this function it is going to be a minimum, but what is the mathematical test for ensuring that it is a minimum. So, let me repeat the question that I have a function of two variables. I do something and I find out a set of values and I claim that at this value the derivative is 0. So, as you know derivative 0 could be for the minimum or for the maximum. So, what do I do to ensure that it is the minimum and not the maximum? Second derivatives. So, it is a function of two variables. So, there are two second derivatives, but there is a little bit of a complication there are these cross derivatives also cross second derivatives. So, the criterion for optimization of such functions is to do with the matrix of second derivatives and the determinant of that I mean some criterion to do with the leading principle minus of that determinant of second derivatives. So, that criterion you can apply here and you can see that it is actually a minimum and not a maximum and not a saddle point either. So, a little bit of background reading will convince you that this set of equations which you get from the minimization principle which is setting partial derivatives equal to 0 will lead to a solution which is well defined and which is also satisfying the property of minimum from the second derivative criterion. Only in the case of multiple variables you have to apply the second derivative criterion in to the appropriate matrix of second derivatives which is that Hessian matrix or second derivative matrix that matrix has to be positive semi definite right you would know this right yes or no. Well I hope many of you know it if not you know refer to your mathematics text book and you can easily check that this is true and this the solution of this set of normal equations will result in a value set of values a and b which will satisfy this criterion and you will get a minimum and not a not a maximum. So, actually you can actually solve this thing explicitly for simple linear regression of one variable and this is what you get. So, actually if you define x bar is equal to the mean value of the x's. So, if I have all these values. So, there is some mean value of the x's that is x bar and I have some mean values of the y's that is y bar and with respect to that I can compute b and once the slope is known the sort of intercept can be easily found out the a which is going to be y bar minus b x bar. So, the mean value is x bar and y bar should satisfy you know a plus b x y bar should be equal to a plus b x bar. So, if I am able to get b. So, anyway I have to solve those two equations and it turns out an explicit solution is possible and I get these values as the unique solution for the observed data. So, in this in these expressions all the x's are known they are the input variables all the y's are known they are the output variables the x bar and y bar are the mean values of the input variables and the output variables respectively and. So, that gives an explicit expression for b and then a is computed. So, I can basically I have to solve the equation simultaneously for the two variables. So, for the given data you can you can see that given all the x values including those repeated values and all that. So, I just compute x bar including those repeated values that is 3.9 and y bar is 3.51 and that those cross terms and all that. So, b comes out to be 0.387 and a comes out to be 3.5 2. So, the regression line is given by y is equal to 2 plus 0.37 x. So, what is the what is the conclusion for this exercise. So, is the experiment justified in claiming that increasing additives will reduce pollution. Well the answer seems to be yes because 0.37 is positive. So, if I increase x y will increase. So, y is actually the reduction in pollution in the emission. So, if I increase x then I will get some reduction in this thing. So, in what range of values is this valid. So, supposing I claim that therefore, if I give x as 20 then I will get a reduction of 20 whatever something in x that may not really be valid because you know the range of values we have used for calibrating thing this thing is only from 1 to 7 and you know there is some error in it already. So, if I cannot extrapolate it to values like x equal to 20 and you know although this equation mathematically valid even for negative values of x that may not make sense. So, I would be careful of interpreting this equation, but it may give me some sort of quantitative assessment of what happens to y as x varies. So, in particular I can use it to interpolate and maybe I can use it to get some quantitative assessment which I may like to use. So, actually the thing is that this we will just see given x i y i is actually a random variable because the error term. So, for a given value of a and b the for a given value of alpha and beta y i is a random variable for a given x i because the error term. So, if I look at the estimators for they are actually in a given experiment they are actually random variables whose expectation is the quantity that we are trying to estimate. So, actually turns out that a and b because this randomness in the phenomenon these estimators they involve the error terms here and so they are actually random variables, but their expectation actually turns out to be equal to the quantities that we are trying to estimate which is a good thing because, but that has to be shown actually. So, that can be shown and they turn out to be therefore unbiased estimators of the quantities that we are trying to estimate. So, now they will have some variance and it turns out that the variance of a is actually this term sigma square 1 over n plus this. So, you can see that the variance of a as an estimator is depends on sigma square which is the inherent variance in the process. Of course, it reduces as n increases it reduces. So, that is again what you would expect that if I have a lot of data I can get a good estimate of the line and it also depends on the variability of the values. So, similarly the so if the values that I have are widely scattered then I have a certain variance and also it depends on the inherent variance in the process. So, actually these estimators that we have constructed from the least squares principle a and b if I view them as outcomes of a random process that means for a given alpha and beta y i is equal to alpha plus beta x i plus e i e i is a random variable then this a and b also are random variables. So, they will have some variance. So, they tell me how much confidence I can have in the estimates that I have produced of a and alpha and beta. So, actually the sum of squares of residuals is actually a random variable and you can further show that sum of square of residuals is actually something to do with the inherent variability sigma square. So, it turns out that sum of square of residuals divided by n minus 2 is an unbiased estimator of sigma square. So, how much is the inherent variability in the process that generates y i for given x i is given by this. So, in the process in summary using the least squares principle, we can actually get estimates of alpha beta and sigma square. So, sigma square is the variance of the error term. So, we can actually get unbiased estimators of alpha beta and sigma square using the data that we have seen. So, the procedure that we have described which is set up the sum of square of residuals minimize it and set equal to 0 gives you a normal equations solve those normal equations in two variables. Then the estimates that you get turn out to be unbiased estimators of of the y the intercept term alpha the slope term beta and the error term sigma square. So, that is the procedure for the least squares principle and getting these. So, this this a or rather this b and then based on that this a and then this sum of squares of residuals divide by n minus 2 they will in total give you the estimates of the underline process. So, one thing is that. So, you can explicitly compute for a given data set a and b and because of the randomness a and b are random variables and it is the expectation and variance of this random variable that we are talking about. So, any questions? So, you can just to conclude this part this lecture see if the data is highly variable look at this case. So, after fitting the best fit line whatever we have it is not exact, but you know it is pretty good. So, there will be a confidence level for the b which is the slope whereas, if I do the same thing in fact it may be the very same line, but this line has got qualitatively different behavior from the other line in terms of my confidence of the prediction. So, how is that captured? If we believe that the underline process alpha plus beta x i plus plus an error term then for a given x i the outcome y i is a random variable. So, therefore, the expression for a and b which I which we derived using that least squares principle that will be a random variable for a given set of experiments where x i are chosen and y i is observed if we believe that it is an outcome of alpha plus beta x i plus an error term then the outcome is a random variable. So, each y i is a random variable given x i. So, the estimates a and b which we have got formulas for they are random variables and in fact they will turn out be sums of some error term. So, in fact you can show that they are normally distributed random variables, but they are random variables. So, they will have some expectation they will have some variance it turns out that the way we have constructed it the expectation of a is actually equal to alpha which is the quantity we are trying to estimate and the expectation of b is actually beta that we are trying to estimate which is which is how it should be and then the variances are also there. So, we can actually use it to construct confidence intervals for alpha and beta and of course, for prediction also we can get some confidence intervals. So, let us see how far we can go in the next class, but if you have any questions now please ask variance and e. So, the e i is assumed to be a normally distributed random variable with variance sigma square and the sum of square of residual errors divide by n minus 2 where n is the number of readings is an unbiased estimate of the sigma square. So, that means that e i term. So, here the sum of square of residual errors is this summation of the it is actually the objective function I mean that function that we had constructed. So, that thing divide by n minus 2 is actually an unbiased estimator of sigma square I mean roughly speaking you know it is the same principle y n minus 2. So, of course, this is the sum of errors right. So, you cannot just take the sum of errors and take that as the estimate of the error you have to divide it by something because this is the sum of so many terms. So, you have to divide it by something, but you know we are using we are using the same data to estimate x bar and we are using the same data to estimate we are using the same data to estimate these two parameters. So, you have to divide by n minus 2 and not n. So, that turns out to be the explanation for that, but something like this is going to be the estimator for the inherent error in the process that is sigma square.