 the problem of multicollinearity exists when two or more regressor variables are dependent or you can say that when two or more random regressor variables are linearly dependent. So, in the last class we talked about different techniques to detect multicollinearity. We learnt about examination of correlation matrix and also we learnt about the you know Eigen system analysis of X prime X matrix or the correlation matrix. So, first I will recall the Eigen system analysis we learnt in the last in the previous class well what we do here is that we first compute the k minus 1 Eigen values of the X prime X matrix and then we compute the condition number which is k denoted by k equal to lambda max by lambda minimum and the larger value of k indicates the severe problem with multicollinearity. As a general rule you know this is what we talked in the last class if k is less than 100 then that indicates no serious problem with multicollinearity. If k is between 100 and 1000 then that indicates moderate to strong multicollinearity, but if k is greater than 1000 then that indicates a severe problem with multicollinearity. So, this is how we you know condition number is used to detect the multicollinearity. The advantage of this Eigen system analysis is that it not only detect multicollinearity it can measure the number of linear dependences in the correlation matrix and also you know it can determine or it can identify the nature of linear dependences between the regressors. So, for that you know what we do is that we compute the condition indices here is the condition indices k j k j is associated with the j th regressor. So, k j is lambda maximum by lambda j and the number of k j greater than 1000 is a useful measure of the number of linear dependences in x prime x and then I illustrated this result using the webster data. This is the webster data I talked about in the previous class. We have six regressors the response variable and I refer this data as you know webster data. Now, the Eigen values for the Eigen values of x prime x matrix of the correlation matrix for the webster data is here you know lambda 1, lambda 2, lambda 3, lambda 4, lambda 5, lambda 6 and the smallest Eigen value is this one which is close to 0. Now, we compute the condition number see the condition number is 2188 which is greater than 1000. So, this condition number indicates the presence of severe multicollinearity in the webster data. So, next what we will do is that we will compute the condition indices also. Since we know the Eigen values we can compute the condition indices right. So, the condition indices are what is condition indices k j k j is equal to lambda max by lambda j. So, k 1 is lambda max is 2.4288 by lambda 1 is basically maximum. So, lambda 1 is 2.4288 which is equal to 1 k 2 is lambda max which is equal to 2.4288 by lambda 2 which is equal to 1.5462 equal to 1.57. Similarly, you compute k 3 which is equal to lambda max by lambda 3 is equal to 7.88, k 4 is lambda max by lambda 4 that is 2.4288 by lambda 4 is 0.921 which is equal to 2.633. So, I did some mistake here this is k 3 this is k 3 this is k 5 anyway you can compute what is k 3 and then k 4 is equal to 3.05 k 5 is equal to this quantity and k 6 is equal to 2188. Now, we know that here only one condition index exceed 1000 that is k 6. So, we conclude that there is only one linear dependence in the data because you know I mentioned before that the number of k j greater than 1000 that measures the number of linear dependences in the data and since here only one k j that is k 6 is greater than 1000 that is why the number of linear dependence or dependences is equal to 1 only. Now, what will I mean this technique has lot of advantages like now it not only detect the presence of multicollinearity it can measure the number of linear dependences in the data and also it can identify the nature of the linear dependences. So, now we will explain that portion here Eigen system analysis can also be used to identify the nature of linear dependences in data. So, let me explain this portion first the correlation matrix x prime x may be decomposed as you know x prime x you can write or decompose as x prime x is equal to t t where d is a diagonal matrix whose main diagonal elements are the Eigen values. So, d is equal to diagonal lambda 1 lambda 2 lambda k minus 1 and t is equal to t is a k minus 1 cross k minus 1 matrix and whose columns of this t matrix they are t 1 t 2 t k minus 1. So, here this t i is the Eigen vector associated with lambda i where t i is equal to a 1 a 2 a k minus 1 is the Eigen vector associated with Eigen value lambda i. So, this is the decomposition of the correlation matrix. Now, if lambda j if the Eigen value lambda i is close to 0 the elements of the associated Eigen vector that is t i describe the nature of linear dependence. So, the nature of this linear dependence is like a 1 x 1 a 2 x 2 plus a k minus 1 x k minus 1 is equal to 0. So, this coefficient of the regressors regressor variables a 1 a 2 a k minus 1 they are basically the elements of t i. So, t i this is the Eigen vector associated with lambda i and lambda is very close to 0. So, t i is the Eigen vector associated with lambda i and t i is equal to a 1 a 2 a right. So, may be just I will just give the little bit motivation behind this you know if lambda i is close to 0 then the condition index associated with lambda i is large I mean that would be greater than 1000 and course I mean and then you will get a linear you will get one linear dependence between the regressor variables associated with lambda i. So, that is why you know as I mentioned before also you know that the number of condition indices greater than 1000 that measures the number of linear dependences in the data and corresponds to corresponding to each lambda i for which the condition index is greater than 1000 you will get a linear dependence or you will get you can identify a linear dependence between the regressors between the regressors. Let me explain I mean illustrate this thing using the web start data here the smallest Eigen value is lambda 6 which is equal to 0.0011 and the associated Eigen vector that is say t 6 is equal to minus 0.447 minus 0.421 minus 0.541 minus 0.573 minus 0.006 minus 0.002. So, this is the associated Eigen vector corresponds to lambda 6 and then the nature of linear dependence is minus 0.447 x 1 minus 0.421 x 2 minus 0.541 x 3 minus 0.573 x 4 minus 0.006 x 3 minus 0.002 x 6 equal to 0 and from here you know since these two are very small we can ignore x 5 and x 6 from this equation and this implies that x 1 is equal to minus 0.941 x 2 minus 1.21 x 3 minus 1.28 x 4. So, this is the linear dependence between the regressor x 1, x 2 and x 3 and x 4 and this linear dependence is associated with lambda 6. And if you have more Eigen value which are close to 0 corresponds to each lambda i which are very small or which are close to 0 will get a linear dependence like this. So, what we learn from this Eigen system analysis is that it can detect the presence of multicollinearity, it can measure the number of linear dependences in the data and also it can identify the nature of linear dependences in the data. So, next we move to the variance inflation factor, this is another way to detect the presence of multicollinearity, this is called variance inflation factors. So, first we will recall the variance of ith regressors, variance of the ith regressor coefficient, I mean variance of ith regressor coefficient means variance of the least square estimate of ith regressor coefficient beta i hat. We know that this one is equal to sigma square x prime x minus inverse the i ith element of x prime x inverse. Now, you know this can be it can be proved that this the i ith element is equal to 1 by 1 minus r i square sigma square here. Now, what is this r i square r i square is the coefficient of multiple determination when x i is regressed on the remaining regressors right. Now, see if the ith regressor x i is equal to i is nearly orthogonal to the remaining regressor, here I know this nearly orthogonal I mean that if the ith regressor is independent of the remaining regressors, then r i square is is small and 1 by 1 minus r i square is close to well. So, the meaning of this one is that know x i is nearly orthogonal to the remaining regressors that means x i is independent of the remaining regressors that means there is no linear dependence associated with x i, I mean x i can cannot be represent in terms of the linear combination of the other regressors. Then the coefficient of multiple determination when you are regressing x i on the remaining regressors the coefficient of multiple multiple determination will be small and the value of 1 minus 1 by 1 minus r square is close to unity and the variance of beta i hat is going to be sigma square close to sigma square. Now, if x i is nearly linearly dependent on some subset of the remaining regressor, r i square value will be near to close to 1, r i square is close to 1, r i square is near unity and 1 by 1 minus r i square is large. So, the meaning of this know x i is linearly dependent on some subset of the remaining regressors that means there is a linear dependence between x i and some subset of the remaining regressors. If some linear dependence is there between x i and some subset of the remaining regressors that means x i can be represented in terms of as a linear combination of some subset of the remaining regressors, well so the which implies that r i square which is the coefficient of multiple determination when x i is regressed on the remaining regressors is will be large and that will be close to unity and that implies that the value of 1 by 1 minus r i square is large. That ultimately you know if x i if there is a linear dependence in the data between the regressors then variance of beta i hat is going to be large because this factor is going to be large. Well, so this factor you know this 1 by 1 minus r i square well let me write here. So, variance of beta i hat is equal to sigma square into 1 by 1 minus r i square and this quantity this factor you know this can be viewed as the factor by which the variance of beta i hat is increased due to linear dependence among the regressors. Well, so if there is a linear dependence between x i and a subset of the remaining regressors then the value of this is large then the variance of beta i hat is also large. Now, if x i is independent of the remaining regressors or I said that it is if it is nearly orthogonal to the remaining regressors then this value this factor is close to 1 and the variance of beta i hat is almost equal to sigma square. So, the variance inflation factor v i f associated regressor x i is defined by v i f by v i f by v i f. i which is equal to 1 by 1 minus r i square and obviously large value of value of v i f i indicates possible multicollinearity associated the meaning of this one is that you know if multicollinearity associated with x i that means if this is large that means there is a linear dependence between x i and a subset of the remaining regressors then only the value of this one is going to be large. So, the large value of v i f i indicates possible multicollinearity associated with the regressor x i. Now, in general you know v i f i greater than equal to 5 indicates possible multicollinearity and v i f i is equal to 5 indicates possible v i f i greater than or equal to 10 indicates multicollinearity is almost certainly. So, this is how the variance inflation factor associated with the i th regressor can be used to detect the multicollinearity. So, this technique variance inflation factor can determine or can detect the problem of multicollinearity, but any I mean of course, it cannot identify the nature of multicollinearity. So, Eigen system analysis is a better technique because it can detect the multicollinearity it can measure the number of linear dependences in the data and also it can identify the nature of linear dependences in the data. So, next we will be talking about if you can if you detect that there is multicollinearity in the data then how to deal with multicollinearity. So, we will be talking about several techniques to deal with multicollinearity right. So, dealing with the first technique is you know collect additional well you know collecting additional data has been suggested as the best method of dealing with multicollinearity. So, this is perhaps the best way to deal with multicollinearity what it says is that let me illustrate this one little bit. Suppose, you have you are in the multiple linear regression model. So, you have a multiple linear regression model and suppose you have only two regressors x 1 and x 2 and the response variable y and you have n data points. So, you have some data you have n data. Now, you have detected that you know multicollinearity exist in this data. That means, here we have only two regressors. That means x 1 and x 2 they are linearly dependent. So, what we do is that we collect some data more data say another m data to break the existing multicollinearity in the present data. So, what we do is that. So, these two are these two regressors they are linearly dependent that is why that is the I mean multicollinearity exist because of the linear dependence between these two regressors. Now, this additional data should be collected in a manner break up the multicollinearity existing. I hope you understood that you know initially you have n data points and here you know x 1 and x 2 they are linearly dependent. So, you collect another some more additional data say m data points in such a way you know when you combine the complete set of data you know n plus m data then x 1 and x 2 are not any more linearly dependent. You have to collect the data in a manner to break up the multicollinearity in the existing data. So, this is you know one way to deal with multicollinearity or to break the multicollinearity in the existing data you know, but in many instance you know this is not possible in practice. Next we will talk about one more technique to deal with multicollinearity that is called remove regressors from the here. Now, if two regressors are linearly dependent it means they contain redundant information. So, what we can do is that you know if two regressors for example, x 1 and x 2 they are linearly dependent may be for example, say x 1 is equal to twice of x 2, but then I mean basically they I mean the information regarding x 2 is redundant. So, what we can do is that thus we can pick one regressor keep in the model discard the other one. So, this basically suggests you know if you have two regressors which are linearly dependent then you can pick one regressor to keep in the model and you can remove the other one. You can remove say for example, x 2 from this from this model right. If say for example, x 1 x 2 x 2 is equal to x 2 x 1 x and x 3 are linearly dependent, then eliminating one regressor variable may helpful to reduce the effect of multicollinearity, but the problem is that you know if say for example, x 1 you have three regressors x 1, x 2 and x 3 in the model and or may be more regressors, but x 1, x 2, x 3 they are linearly independent, then may be you can remove one regressor for example, x 3 from the from the model, but to reduce the you know to reduce the effect of multicollinearity, but if x 3 it might happen that you know x 3 the regressor which you have removed that x 3 might be you know significant to explain the variability in the response variable. In that case you know this removing one regressor may damage the predictive power of the model well. So, that is why it says that you know eliminating regressor to reduce multicollinearity may damage the predictive power of the model. So, this is one way to deal with the problem of multicollinearity and the other technique is collapse variables. So, it says that you know you combine if there are linear dependence between two or more than two regressors, you combine two or more variables which are linearly dependent into a single composite variable. So, basically this collapse variables it says that if you have linear dependence between say two or more than two regressors, then you can combine you know those regressors by a composite regressor variable. So, these are the techniques to deal with multicollinearity. So, one is you know collecting more data point and the other one is removing one regressor from the model and combining the regressors which are linearly dependent. So, that is like. So, in this module we have learnt what is multicollinearity and you know this multicollinearity is the name of a problem in multiple regression model well. So, the problem of multicollinearity arises when two or more regressor variables are linearly dependent and we have learnt you know how to how to detect multicollinearity if it exist in the data and also first we learnt about you know how what are the problems due to multicollinearity and then we have learnt you know how to detect multicollinearity if it exist in the data and also we learnt how to deal with multicollinearity. So, that is all for today. Thank you for your attention.