 So, we wrote down n equations and the moment we have n equations, we think that we can solve all those equations and get n estimates of the parameters. But I told you that even though we perform n experiments, the number of regression parameters or regression coefficients we estimate is smaller than n. Essentially that means we are not solving those n equations in n unknowns, right. It makes no sense also to have a huge number of parameters in our regression model. We are having such a large number that they equal to the number of experimental observations. We need only a small set of parameters. So, we do not solve those equations simultaneously, we adopt some other method. You can solve n equations in n unknowns using the matrix method. But we are going to use another method also involving matrices to get good estimates of the parameters beta hat, beta 0, beta 1 hat, beta 2 hat so on to beta k hat. So, we have totally p parameters to estimate, p is equal to k plus 1. The k regression coefficients associated with the k regressor variables plus the intercept beta 0 hat which makes it k plus 1, we call p is equal to k plus 1. All these things are very clear. The method we are going to adopt is the least squares estimation technique for the parameter set given by the beta column vector. So, what we do is we define a function L which is defined as sigma i is equal to 1 to n epsilon i squared. In matrix notation, epsilon i squared may be written as epsilon prime epsilon. This is very simple. I will explain this. Epsilon was given by epsilon 1, epsilon 2 so on to epsilon n. When you take transpose, that is what is given by epsilon prime, you get epsilon 1, epsilon 2 so on to epsilon n. So, this is n rows one column. This is one row n column. You cannot do epsilon, epsilon prime because then you will get n by 1 into 1 by n, n rows one column, one row n column. And when you do that, you will get n by n. How is that possible? Okay, epsilon 1 squared and then you will get epsilon 1, 2. So, this is not what we want. We want epsilon which is 1 by n into n by 1. Then these two will get cancelled out and then you get 1 by 1. What is epsilon prime epsilon? You have epsilon 1, epsilon 2 so on to epsilon n and then you have epsilon 1, epsilon 2 so on to epsilon n and here you get epsilon 1 squared plus epsilon 2 squared plus so on to plus epsilon n squared. So, what is happening is epsilon 1 is multiplied with epsilon 1 here plus epsilon 2 is multiplied with epsilon 2 here. So, on to epsilon n is multiplied with epsilon n here so that we get sigma epsilon i squared i running from 1 to n. So, what we want to do is we want to identify the parameters beta such that dou L by dou beta is equal to 0. We want to minimize L. What is L? L is the sum of the square of the errors and what is error? That is very simple. What is error? This is very important. Error is y minus beta x. We wrote the model y is equal to beta x plus epsilon. So the error term is equal to y minus beta x. Sometimes you may have positive errors, sometimes you may have negative errors and to account for the error in an impartial manner we do not want to add the positive errors and the negative error and then show us net small error or 0 error. We square the errors so that irrespective of whether the error is positive or negative you are squaring all the errors so that all of them become positive and we get a complete total error. So we have sigma epsilon i squared and that may be written in matrix notation as epsilon prime epsilon. Now we have the sum of the square of the deviations and we want to minimize that. This is the good old least squares principle. You might have done that in maybe higher secondary or in the second year of your engineering program but the main idea is the same. In matrix notation we have dou L by dou beta is equal to 0. epsilon prime epsilon is also scalar and so you have after the dust has settled down x prime x beta hat is equal to x prime y or beta hat is equal to x prime x inverse x prime y. This is a very famous equation. You can predict the parameters for your regression model by multiplying x prime y matrix with x prime x inverse. To find beta hat we pre-multiply x prime y with x prime x inverse. So this is going to directly give us the set of parameters. I am not giving you the proof and this follows from this relation. So to find beta hat we pre-multiply x prime y with the inverse of x prime x inverse. So if I put x prime x inverse on both sides x prime x inverse x prime x will become the identity matrix and then you have this matrix inverse multiplying with x prime y that will give you directly beta hat. You do not have to do these calculations by hand. Sometimes for large matrices finding the inverses may become cumbersome and error prone. You can use the aid of mathematical software like MATLAB for instance to do the matrix manipulations. So let us look at the dimensions of these matrices. We are having x prime x beta hat is equal to x prime y, beta hat is equal to x prime x inverse into x prime y. So are the dimensions consistent? It is a good time to summarize the dimensions of the different matrices involved. X matrix is having n rows and p columns, n experimental observations and p regression coefficients or the parameters beta not so on to beta k which would be k plus 1 parameters. X prime x transpose of x multiplying with x, so if x is n by p transpose of x would be p by n. When I am saying transpose I am interchanging the rows and columns. So suppose I have a matrix with a certain number of rows and certain number of columns when I take the transpose of that matrix I am converting rows into columns and columns into rows. So if I am having n by p for x matrix the transpose of the x matrix would have dimensions of p by n. You had n rows and p columns originally in the x matrix. In the x prime matrix you have p rows and n columns because the rows and columns have interchanged. So when I multiply these 2 the n by n will cancel out and then we have p cross p. X prime y would be p cross n as we saw here and y is a column vector of n observations. So it will be n by 1. So x prime y would be p cross 1 and beta hat would be p cross 1. It is a column vector comprising of p parameters and 1 column. The p parameters are arranged row wise and this should be capital Y which is the vector of responses n rows and 1 column. I will just make it as capital Y. So the fitted regression model is given by y predicted is equal to x beta hat. The difference between the actual height observation and the predicted value given above is called as a residual and may be expressed in matrix notation as E is equal to y-y hat. So we have been using epsilon and now I am using E. There is a reason for this change. Epsilon represents the true error, the random component of the experiments reflected in the form of epsilon, the random experimental error. But I am using E. I am saying that E is a residual and that residual may be only due to the random error or it may also be including the unexplained effects in the experiment because of inadequate modeling, okay. If my model is not fully able to explain the variations in the experimental response then that discrepancy cannot be dismissed as random error. So my residual contains possibly unexplained variability and also the experimental error. So I am using E here. If all the possible variability has been accounted for in the model then the residuals would be a true reflection on the random error component. So first we will calculate the error sum of squares. The moment we see the term sum of squares we can guess that some analysis of variance is involved. So you have y-y hat is equal to E which is the residual. E prime will become y prime-y hat prime and then we have y prime x beta hat is equal to beta prime x prime y but both are the same. Both are the same because both of them are scalars. It can be easily shown that y prime x beta hat is a scalar. It has dimensions of 1 by 1. That calculation is shown here and then you also have beta prime x prime y which is again a scalar and that dimensions are also shown in this calculation. The n's will cancel out nicely leaving 1 by 1, right. Now we can show elegantly that the sum of squares of the error is given by y prime y-beta hat prime x prime y. This is very interesting. Let us see the proof. John, board and then do the derivation directly so that people who are not familiar with these can follow the steps. So we have y-x beta hat prime into y-x beta hat. So I take the prime inside and we know that when you take a of b prime this becomes b prime a prime. So using that there is no problem with y it becomes y prime x beta hat whole prime becomes beta hat prime into x prime into y-x beta prime. So this becomes y prime y-y prime x beta hat minus beta prime x prime y plus beta prime x prime x beta hat. So I am just multiplying the last two terms here. So this will be beta hat prime x prime x beta hat. So we saw that from the previous slide that y prime x beta hat is equal to beta prime x prime y. So you also have minus beta prime x prime y plus you have this term. So these two can be written as minus 2 beta hat prime x prime y so we get y prime y minus 2 beta hat prime x prime y plus beta hat prime x prime x beta hat and we also know by definition that x prime x beta hat is equal to x prime y. You may recollect that we found the parameters beta hat by taking x prime x inverse x prime y. So this is the equation which I can use here and that becomes beta hat x prime y. You have minus 2 beta hat x prime y so you get y prime y minus beta hat x prime y. So this completes the derivation. It is always good to go to the board and do the derivations. Power points also have their own charm and make teaching more convenient. On the other hand writing on the board you make mistakes, you correct the mistakes and learn in the process. I hope that you would also feel interested to do the derivations independently on a paper and see whether your results are matching with the final expected results. Now we are going to talk about a very important property of linear regression analysis. It is the variance covariance matrix. So obviously we are going to talk about the matrix and we have found the parameters beta 0, beta 1 so on to beta k. We want to know how precise these parameter estimates are. So to have an idea about that we can use the variance covariance matrix. So let me just introduce the matrix to you and then we will talk about how to apply it in real regression problems. So the variance of the least square estimators are the elements of x prime x inverse matrix multiplied by the error variance sigma squared. Where did we see sigma squared previously? I showed a figure where we had the true relationship line and we showed the experimental data scattered around this line. We said that the scatter was described by a probability distribution which was a normal distribution. The mean of that distribution was given by the equation but the data was not exactly aligning with that value given by the equation or the data was not present exactly at the mean value but it was present somewhere else because of random effect. This probability distribution was having a variance sigma squared. All the data points which were scattered around the true line had the same variance sigma squared. Let me go to that figure again. So here we have this particular figure. You have the true line and then you have the data scattered around this line and the scatter is because of experimental error. The scatter is described in terms of probability distributions and these are normal distributions most conveniently and the mean of these distributions are given by the equation. This mean would be different from this mean because this x value is different from this x value. Here you have one x value, here you have another x value. Then you have the data points which are scattered around this and the probability distribution is having a mean given by beta 0 plus beta 1x and changes with x but it has a constant variance sigma squared. This is the sigma squared we are going to use in the variance covariance matrix. So now coming back to the variance covariance matrix, we have x prime x inverse multiplied by sigma squared and that sigma squared was the constant error variance. But unfortunately we do not know sigma squared. We need to estimate the parameters beta 0, beta 1 so on to beta k. We also need to estimate the error variance sigma squared. We are having a variance covariance matrix. That term implies that matrix contains both variances as well as covariances which are the variances and which are the covariances. The diagonal terms in this matrix x prime x inverse sigma squared matrix represents the variances. The diagonal terms means first row first column and the second row second column the element corresponding to third row third column and so on. So if you look at the matrix, the main diagonal will comprise of the variances of the estimated parameters. The off diagonal elements of the variance covariance matrix will represent the covariances between the parameters. So just now we are introducing the variance covariance matrix. It is important and enough at this point if you understand that the variances are given by the diagonal terms and the covariances are given by the off diagonal terms. And it is also important to note that the variance covariance matrix is symmetric. What is a symmetric matrix? A symmetric matrix is one whose appearance is unchanged when you change rows into columns and columns into rows. So you can have a symmetric matrix and when you interchange the columns and rows it appears to be just as the same. It is also important to note that the variance covariance matrix or simply called as the covariance matrix. Sometimes people call it as a variance matrix also. So the covariance matrix dimensions are p cross p. Do not get confused if sometimes you see variance covariance matrix or another times you see covariance matrix both are the same. Basically we have come to the form of the variance covariance matrix or simply the covariance matrix that is represented by C. It is bold. It is bold because it indicates a matrix and not a scalar and here we have x prime x inverse sigma squared. Let us represent the columns of x prime x inverse as C naught naught, C naught 1, C naught 2, C 1 0, C 1 1, C 1 2, C 2 0, C 2 1, C 2 2. So this represents the diagonal of the matrix. So C naught naught, C 1 1 and C 2 2 are the diagonal elements of this matrix and the off diagonal elements are given by those entities which are not along the diagonal. So all elements other than C naught naught, C 1 1 and C 2 2 are off diagonal elements. So you have sigma squared outside. You may as well take sigma squared inside and multiply all these terms with sigma squared. So this is a symmetric matrix. What is a symmetric matrix? If I change rows into columns and columns into rows, the matrix appearance is unchanged. That means C naught 1 will be equal to C 1 0, C naught 2 will be equal to C 2 0. If C naught 1 is equal to C 1 naught, then it will appear as if there is no change. Similarly C naught 2 will be equal to C 2 0 and here also C 2 1 will be equal to C 1 2 and C 2 0 will be equal to C 0 2. So then you have a symmetric matrix and that is what I have represented here. C 1 0 is equal to C naught 1, C 2 0 is equal to C 0 2 and C 2 1 is equal to C 1 2 and this is a x prime x inverse matrix. It is a symmetric matrix and the variance of the estimated parameters beta hat are given by the diagonal term sigma squared C jj where j is equal to 1 or 0 or 2 and the covariance between 2 different parameters beta i and beta j are given by the off diagonal terms. So there is obviously a typo here which I will correct. So that is what you have here. The covariance between 2 different parameters beta i hat and beta j hat will be sigma squared C ij where i is not equal to j. As I said earlier, we do not know the value of sigma squared and hence we need to have an estimate for the error variance and we also have to make sure that this estimate is the true error variance and not have any systematic influences. So we want to replace sigma squared by an estimate of sigma squared. We call it as sigma hat squared. In the true regression model, we had y is equal to x beta plus error. That beta is the column vector of true parameters describing the experimental phenomena or rather the phenomena investigated by the experiments. Since we do the parameter estimation based on the available experimental data, the influence of errors are also there and hence our model is not accounting for the random phenomena, it is only accounting for the systematic phenomena and hence we are only getting estimates of beta and that is given by beta hat. Even though we would like beta hat to be as close as possible to beta, we may not be able to achieve the aim because the data is subject to experimental uncertainty which is not included in our regression model. It is accounted for separately as the error term. Similarly, when you want to use sigma squared, since the true value of sigma squared is not known, we need an estimate of sigma squared which we use in our calculations and we call that sigma squared as sigma hat squared. How to find it out? We will see shortly and then once you are able to find the sigma hat squared, the square root of the estimated variance of the jth regression coefficient is called as the estimated standard error of the least squares estimator beta hat j. Now we do not use the term standard deviation here, we are again taking the square root of the estimated variance of the jth regression coefficient and we call it as estimated standard error, it is not called as standard deviation. So the estimated standard error of the least squares estimator is given by Se, Se stands for standard error for beta hat j and that is given by square root of sigma hat squared Cjj. So this j here is matching with the j's given here and this is an estimated value and we are using the variance, covariance, diagonal element multiplying it with the estimated error variance and when we take the square root, we get the standard error of the estimated beta hat j. How did we get the beta j hat or beta hat j? It is the least square method we adopted to find this parameter and hence it is called as the least square estimator of beta j and it is represented by beta hat j. The standard errors are a measure of the precision of the estimation for the regression coefficients. Small standard errors imply good precision. So it is a measure of the fuzziness associated with the estimated regression coefficients. If the fuzziness is too much, then there is a big spread around the estimated beta hat. But if the spread given by the standard error is quite narrow, then we have estimated betas or the beta hats with good precision or reasonable precision. For analysis of variance purposes, the estimation of the residual error is required. In order to also get an estimate of sigma squared, we need the estimation of residual error. The residual should ideally reflect the difference due to random factors and not systematic discrepancies created by using an inadequate model. And the residual error is also an unbiased estimator of sigma squared. So we define the error variance in the following manner. We have sigma hat squared is equal to sigma is equal to 1 to n EI squared divided by n minus p and that is given by sum of square of the error divided by n minus p. It is the difference between the actual experimental value and the predicted value. The error in a column vector form as y minus y hat experimental vector and this is the model predicted vector. Each entity we may represent it as EI, I running from 1 to n. So it will be yI minus y hat I for the residual I or the ith residual. Sum of square of the error is actually given here itself. I is equal to 1 to n yI minus y hat I whole squared and that is summed to give the sum of square of the error. So if you look at Kutner et al 2004 reference, we can define the covariance matrix or the variance covariance matrix as the expected value of beta hat minus expected value of beta hat. That is multiplied by again transpose of beta hat minus expected value of beta hat. So you have this and we have to expand it and express it in matrix form. So we have expected value of beta naught minus beta naught beta 1 hat minus beta 1 beta 2 hat minus beta 2. So we have that matrix multiplying with this matrix compressing of only 1 row and 3 columns. So we have transpose of this written here, transpose I am converting rows into columns. So I am having 1 column here with 3 rows. I am converting it into a matrix with 1 row and 3 columns. So this becomes beta naught minus beta hat minus beta naught and then this goes to the row element beta 1 hat minus beta 1 then it goes to the next row element beta 2 hat minus beta 2. So this is what we have. I can first multiply this and then find the expected value as given by E and also an important thing to realize is the expected value of beta hat is that parameter beta itself. So this is an unbiased estimator beta hat. Now when we first multiply this it becomes symmetric matrix and the expected value of the diagonal terms will become the variances and the expected value of the off diagonal terms will become the covariances. So let us do the multiplication we can beta naught minus beta naught into beta naught hat minus beta naught that is what you have here, beta naught hat minus beta naught into beta 1 hat minus beta 1 and that is what you have here. You may want to do the calculations on your own to see whether you get this particular form. So this term here is equal to this term and this term here is equal to this particular term. So the matrix is symmetric. We can show it for other terms also. What other term we can show? So you have 3 2 will be equal to 2 3, third row and second column element is this that should be equal to the second row and third column element and that is what these two are matching. So we can conclude that the matrix presents inside the argument is symmetric then we apply the expectation to all these elements in the matrix and we know that the expected value of beta naught hat minus beta naught squared is nothing but the variance of beta naught hat okay. So that becomes quite straightforward and the off diagonal terms will become expected value of beta naught hat minus beta naught into beta 1 hat minus beta 1 and that would relate to the covariance beta naught hat and beta 1 hat. So to find the sigma hat squared we have sigma is equal to 1 to n EI squared divided by n minus p that is sum of square of error divided by n minus p. So this is how we are finding an estimate of sigma squared the error variance and since we are finding an estimate we denote it as sigma hat squared and sigma is equal to 1 to n EI squared divided by n minus p equal to sum of square of error by n minus p. So that is accounted for. And then what we do is we have that value sigma hat squared and then we can multiply all these elements with that sigma hat squared and then we can find the different variances and covariances of the parameters and their combinations. So when I multiply this, this would become the variance of beta naught hat and this would be the variance of beta 1 hat and so on. Now we are going to talk about regression sum of squares and error sum of squares. So what we do here is we express the deviation between an actual response and the predicted response in terms of two deviations, deviation of actual response from mean response and deviation of predicted response from mean response. So rather than putting it in the words, let us see in symbols. So first we are expressing the deviation of a particular response from the average value of the observations. You are conducting all the n experiments and you take the average value of the response. This is yi minus y bar and that may be written as yi minus y hat i plus y hat i minus y bar. This will cancel out with this. So you will be finally getting yi minus y bar or we can write this equation involving the deviation of the actual experimental data point from the predicted value y hat i as y minus y bar, yi minus y bar minus yi hat minus y bar. So we are defining the residual as the sum of two or as the difference of two entities, deviation of the experimental observation from the mean value, deviation of the predicted value from the mean value. This is very interesting. Directly writing yi minus y hat i, you are writing it as yi minus y bar minus yi hat or y hat i minus y bar. So you are adding and subtracting and adding y bar in this expression. So the residual which is the discrepancy between the actual experimental value and the predicted value is expressed as the difference between yi minus y bar and yi hat minus y bar. So what is the discrepancy of yi with respect to the mean value that I will subtract with the discrepancy between the predicted value and the average value y bar. So that can be shown graphically in a nice fashion. This is the actual experimental data point and that is slightly off from the prediction value. This is yi and then you have y hat i and this is the average value based on all the experimental data points. So we want to find the deviation between these two, the residual that may be expressed as the deviation of yi with respect to y bar minus the deviation of y hat i with respect to y bar. Next we go to hypothesis testing and linear regression. So we are going to start a new phase in the regression analysis and it is also interesting to note that whatever we studied in the first part of statistics for experimentalists are coming into play in the second part as well or in the second phase as well. So now you will be able to appreciate with your background and inferential statistics on what we are going to do with linear regression. When we apply those to linear regression things will become very clear and you also will understand why we are doing these kinds of tests. So the hypothesis test is what we are going to study in detail in the next lecture. I request you to not only brush up your fundamentals in linear algebra but also look at the concepts we covered in hypothesis testing. Find out what is meant by level of significance, the p value, the region of acceptance, region of rejection, the confidence intervals. Please refresh your concepts on these topics and if and if and once you have done so whatever we are going to discuss next will become very simple. Thank you for your attention.