 lecture on multicollinearity and in the previous lecture we have learnt what is multicollinearity. The problem of multicollinearity arises when two or more than two regressors variable are linearly dependent. Today basically we are in the last class also we talked about the presence of multicollinearity has severe effect on least square estimates of regression coefficients. Today again we will talk about the effects of multicollinearity or the problems due to multicollinearity and also we will learn how to detect the presence of multicollinearity in the data. So, let me first talk about some more problems due to multicollinearity. In the last class we proved or we illustrated the fact that the strong multicollinearity results in large variance and covariance of the regression coefficients. So, we illustrated this issue using in the case of multiple linear regression use when there are two regressors in the multiple linear regression model and also in general we have proved this fact that the strong multicollinearity results in large variance and covariance of regression coefficients and today we will be talking about some more problems due to multicollinearity. The second one is multicollinearity tends to produce least square estimate beta hat that are too far from the true parameter. So, here we compute the square distance between the least square estimate of the regression coefficient beta i hat minus beta i. So, this is the estimate of the ith regression coefficient and this is the true value of the ith regression coefficient and the square distance is this one and sum over i is from 1 to k minus 1 right and we denote this by l square. Next what we do is this is the square distance from beta hat to the true parameter value beta. Next we compute the expected value of the square distance. So, expected square distance we denote it you know this is basically expectation of l square which is equal to expectation of beta i hat minus beta i square sum over all i right. Well what we know is that know this beta i hat is the least square estimate of beta i and we know that the least square estimates are unbiased estimated here. So, we know that expected value of beta i hat is equal to beta i. So, this one is nothing but expected value of beta i hat minus expectation of beta i hat right and this one is nothing but the variance of beta i hat variance of beta i hat. Now, the variance of beta i hat in multiple linear regression model is equal to summation sigma square x prime x inverse the j sorry i i hat element and this one is equal to sigma square by the i i hat element which we proved before also it is 1 minus r i square sum over i. And this one is basically sigma square sum 1 by 1 minus r i square. So, this r i square is the coefficient of multiple determination where r i square is the coefficient of multiple determination for the regression of x i on the remaining k minus 2 regressors. And also we so what we proved is that expected value of this square distance E L square is equal to sigma square sum 1 by 1 minus r i square sum over i and when there is multicollinearity k minus 1 1 by 1 minus r i square will be large for at least 1 i well. So, this r i square is the coefficient of multiple determination for the regression of for the regression of x i on the on the remaining k minus 2 regressors. Now, you know multicollinearity means the problem of multicollinearity arises when 2 or more regressors are linearly dependent. Now, if say for example, if the i th regressor x i is linearly dependent on the remaining regressors then r i square the coefficient of multiple determination associated with x i is close to unity that means r i square will be will tend to 1 when r i square will tend to 1 then 1 by 1 minus r i square will tend to infinity. And then the expected value will be I mean that is why it says that when there is multicollinearity this term will be large for at least 1 i. So, it depends on you know if x i is linearly dependent on the on the remaining regressors then this term will be large and this term is large means r i is close to I mean when 1 minus 1 sorry 1 by 1 minus r i square will tend to infinity as r i square tends to 1. Well, so next we will talk about some more problems due to multicollinearity that is the model coefficient with negative sign when positive sign is expected. So, that it says that you know you may get negative sign for some regression coefficient. So, when you expect you know you are really expecting that positive sign for that regression coefficients. So, this might be the effect of multicollinearity. And the next issue is say 4 it says that high significance in a global F test, but in which none of the regressor are significant in partial F test. So, I like to illustrate this point also it says that no high significance in the global F test, but none of the regressor are significant in partial F test. So, for this one I want to recall one example from multiple linear regression model. I mean there is only one example in I mean I talked about in module 2 that means in the multiple linear regression model. I am going to recall that example. So, here is the data I mean or here is the example we considered in module 2 multiple linear regression model. Well, so here we have two regressors and one response variable and we have the data for that well we know how to fit a multiple linear regression model here. The fitted model if you can recall the fitted model is y hat equal to 14 minus 2 x 1 minus x 2 by 2. Now, I am going to write once we have the fitted model what we do is that we check the significance of the fitted model by using the global test. That means we test the hypothesis that beta 1 equal to beta 2 equal to 0 against the alternative hypothesis that H naught is not true. So, the significance of the null hypothesis that beta 1 equal to beta 2 equal to 0 it says that no there is no linear relationship between the response variable and the regressor variables and we test this hypothesis using the global F value that is obtained from the ANOVA table. So, here is my ANOVA table for this data. You just refer my classes in module 2 well. So, we have perhaps 11 data this that is why the total degree of freedom is 10 and the F value is 7.17. So, this is the total and this follows this F follows F distribution with degree of freedom 2 and 8 and since the observed F value is larger I mean greater than the tabulated F value we reject the null hypothesis beta 1 equal to beta 2 equal to 0. That means we accept we accept the alternative hypothesis that is beta 1 beta i is not equal to 0 for at least 1. So, the ultimate conclusion from this test is that you know the fitted model is significant. So, the global test says you reject the null hypothesis where the null hypothesis is you know it says that there is no linear relationship between the response variable and the regressor variable. So, we are rejecting that null hypothesis that means we are accepting the fact that there is linear relationship between the response variable and the regressor variables. Now, we go for the partial F test. So, here is my partial F test the first it says that you know what does X 2 contributes given that X 1 is already in the regression. So, the contribution of X 2 in the presence of X 1 is the distribution of X 2 in the presence you know whether the contribution is significant that can be tested by testing the hypothesis that beta 2 equal to 0 against beta 2 is not equal to 0. So, this one basically it test the significance of X 2 in the presence of X 1 in the regression model and you know either you can go for the partial F test or you can go for the T test to test this hypothesis well. So, here I took the T statistic approach here the T value is this one. Now, the tabulated T value is 2.306 right and the observed value is not greater than the tabulated value. So, that means we accept the null hypothesis that beta 2 is equal to 0. So, accepting this null hypothesis means accepting the null hypothesis beta 2 equal to 0 means that the regressor second regressor X 2 is not significant in the presence of X 1. Now, what we do next what we will do is that we will test the significance of X 1 now in the presence of X 2. Well the next one is what does X 1 contributes given that X 1 is already in the regression. That means the significance of X 1 in the presence of X 2 in the model this can be tested by testing the hypothesis that you know H naught beta 1 equal to 0 against beta 1 not equal to 0. Here is the test statistic value here is the tabulated value again you see that the tabulated sorry the observed value is not greater than the tabulated value that is why we accept the null hypothesis beta 1 equal to 0. So, that means none of the partial F test are significance partial test or the t test none of the partial test are significance. So, what we observed here is that see the global F test is significant the global F test says that the there is a linear relationship between the response variable and the regressor variable, but when we go for the partial test then none of the regressors are significant neither X 2 is significant in the presence of X 1 nor X 1 is significance in the presence of X 2. So, this is one example basically if the global F test is significant then at least there should be one regressor which is significant but here we are getting the result you know the global F test is significant, but none of the partial test are significance that means this is this might this might be the effect of multicollinearity. So, you can check you know whether multicollinearity exist in the given data that means whether X 1 and X 2 are linearly dependent that you can check. So, this is one example of the effect of multicollinearity in the data right. So, next we move to another effect of multicollinearity that is that says that you know different model selection procedures yield different. So, model selection means you know model selection we talked about selecting the best model in module 3 perhaps well we know how to select the best model using the all possible selection and also the step wise selection. Well, so what it says that if you have if there is a different model selection then the presence of multicollinearity in the data then different model selection technique will produce different model. So, this is if you know of course, you know different model can different model selection procedure can yield different models, but if there is a multi collinearity present in the data then the then the different I mean with high probability that you know the different model selection procedure will yield different models. So, these are the you know the different problem that can occur due to multicollinearity we talked about next we will be talking about the I mean we will talk about we will talk about the different techniques to detect the multicollinearity. The first technique is examination of correlation matrix X prime X. So, it says that simple measure of multicollinearity is inspection of diagonal elements that is R i j X prime X. So, we know what is correlation matrix given the original data if you scale and center them then the transform data or the modified data the X prime X for the modified for the modified data is called correlation matrix. Now, the op diagonal elements of the correlation matrix is R i j's where R i j is the sample I mean is the correlation coefficient between the regressor X i and X j. Now, you know if regressor X i and X j if they are dependent then the R i j will be near to unity. So, that means if the regressor X i and X j the correlation coefficient for the regressor X i and X j value is high that indicates the presence of multicollinearity. So, as a general rule we said that you know if R i j is the correlation coefficient between the regressor X i and X j if this one is greater than 0.9 then it indicates R i j greater than 0.9 indicates multicollinearity problem. Now, examining the correlation matrix X prime X is helpful in detecting the correlation linear dependence between pairs of regressors, but the same is not you know let me complete you know, but examining the correlation matrix X prime X is not helpful in detecting multicollinearity problem arising from linear dependence between more than regressors. So, let me explain what I wrote here it says that the examining the correlation matrix is helpful in detecting linear dependence between pair of regressors, but the same is not helpful in detecting multicollinearity problem arising from linear dependence between more than two regressors. Well, so what it says basically you know if the problem of multicollinearity is the problem of multicollinearity is due to the linear dependence between two regressors then the correlation matrix can detect that. If the multicollinearity problem is due to the linear dependence between two regressors, but it might be the case say we are in the multiple linear regression set up. So, there are k minus 1 regressors and k minus 1 could be greater than 2, but it might be the case that you know the problem of multicollinearity is due to the linear dependence between more than two regressors. In that case you can this correlation matrix cannot detect that cannot detect the presence of multicollinearity in that case. Well, so let me give one example to illustrate this fact. I will be taking this data it says that you know unstandardized regressor response variables from Webster, Gunt and Mason. Well, so I will refer this data as Webster data. Here we have six regressors response variable and to check you know what I want to say here is that later on we will see that you know this data I mean this data has the problem of multicollinearity, but the correlation matrix cannot detect it because the multicollinearity problem in this Webster data is not due to the linear dependence between two regressors. Here we have linear dependence between the regressors involving more than two regressors. So, what we do is that first we will compute you know first we will compute the correlation matrix for this data and here is the correlation matrix, here is the correlation matrix for the Webster data you see the op diagonal elements here none of them are suspiciously large. Here you know this is perhaps the highest correlation value. So, here we say you know we say that R i j, so this is R 1 2 this is R 1 3 we say that you know R i j if R i j is greater than 0.9 that indicates this indicates indicates multicollinearity problem, but here none of the pair wise correlations are suspiciously large and we have R i j is greater than no indication of of linear dependence. So, here you know the inspection of R i j is not sufficient to detect the multicollinearity. So, the ultimate conclusion is that you know the examining the correlation matrix is is not sufficient to detect the multicollinearity problem if the multicollinearity problem involves linear dependence between more than two regressors. Well, so next we will be talking about one more techniques to detect the multicollinearity. So, so the the examining the correlation matrix is not sufficient always. So, next we talk about Eigen system analysis of x prime x. So, it says that multicollinearity can also be detected from the Eigen values of the correlation. So, I hope you know what is Eigen value and Eigen factor associated Eigen vectors. So, for a k minus 1 regressor model this one is k minus 1 cross k minus 1 matrix. There will be k minus 1 Eigen values say lambda 1 lambda 2 lambda k minus 1. So, now you know if there are one or more linear dependence in the data. So, then one or more Eigen values will be well. So, basically you know if you have a small Eigen values that implies small Eigen values implies that there is a linear dependence between the columns of x. Now, we define the condition number define the condition number of x prime x as k which is equal to lambda max by lambda minimum. So, lambda max is the maximum Eigen value and lambda minimum is the minimum Eigen value. Now, you know that a small Eigen value indicates linear I mean one or more small Eigen values indicates one or more linear dependence in among the regressor variable or linear dependence among the columns of x. Now, look at the condition number here. If lambda minimum is small or very close to 0 then this condition number is going to be large and since lambda minimum if lambda minimum is small then you know there is a linear dependences in the data. So, from there we can conclude that when lambda minimum is small k is going to be large. So, the large value of k indicates the presence of multicollinearity in the data. So, we will give a general rule of what is k? k is lambda minus lambda minus lambda minus lambda minus lambda minus lambda minus lambda max by lambda min. As a general rule k less than 100 indicates no serious problem with multicollinearity. Now, k in between 100 to 1000 indicates moderate strong multicollinearity. And k greater than 1000 indicates a severe problem with multicollinearity. So, large value of k indicates the severe problem with multicollinearity because of the fact that you know k will be large when lambda minimum is very small or close to 0. Well and since lambda minimum is 0 indicates there is linear dependence among the columns of x. That means the presence of multicollinearity. So, this is the condition number. Now, we define the condition indices. The condition indices of the x prime x matrix are k j which is equal to lambda max by lambda j for j equal to 1 to k minus 1. And of course, you know clearly the largest condition index is the condition. So, this will be large when lambda j is minimum and that is nothing but the condition number. Now, this Eigen system analysis it is not only you know it not only detect the multicollinearity problem. Also it can measure the number of linear dependents dependence indices in the data. So, that can be I mean the number of k j or condition indices greater than 1000 is a measure of the number of linear dependences in the data. So, the number of of k j greater than 1000 because if see if k j is greater than 1000 then the number of k j that indicates severe problem with multicollinearity. Now, number of k j greater than 1000 is a useful measure the number of linear linear dependences in x prime. If we consider the Webster data the Eigen values for Webster data are. So, there we had you know six regressors. So, six Eigen values lambda 1 equal to 2.45 and lambda 2 equal to 2.45. 4288 lambda 2 equal to 1.5462 lambda 3 equal to 0.921 lambda 4 equal to 0.794 lambda 5 equal to 0.3079 lambda 6 equal to 0.0011. So, this Eigen values for Webster data are. So, this smallest Eigen value is very close to 0. So, this one is basically you know the there is we say that there is there is only one small Eigen values. Small Eigen values means you know very close to 0 and the condition number here also you know small Eigen values indicates linear dependence in the data one linear. So, since there is only one small Eigen value we can say that there is there might be only one linear dependence dependency in the in the data. But of course, we have to check with the condition indices also. Let me first compute the condition number. The condition number is the condition number is k which is lambda max lambda max is lambda 1 here 2.4288 by 0.0011 which is equal to 2188.2 1 1 and this is larger than 1000. So, this indicates which indicates severe multi severe problem. So, what I want to say here today regarding the detection of multicollinearity problem is that you know first we consider this Webster data and examining the correlation matrix could not detect the problem of multicollinearity in the Webster data. Whereas, you know this Eigen system analysis technique it says that since the condition number is 2188 which is much larger than 1000. So, it says that there is a severe problem with multicollinearity in the Webster data. So, not only this one in the next class we will be talking more about you know we will compute the condition indices and from there we will see how many linear dependences are there. So, that is all for today. Thank you for your attention.