 Hello everyone, welcome to the session of regression analysis. In the previous session we have discussed multiple linear regression. But in multiple linear regression when you have many independent variables to regress the dependent variable, sometimes we observe some independent variables are interrelated that means they are correlated. So, in case in any multiple regression if some independent variables or exploratory variables are correlated to each other then the multicollinearity comes in regression. Today in this session we will discuss the concept of multicollinearity and how it can be handled. So, let us see now as I mentioned in multiple regression if some variables like you know you can think about example of say your age and say your experience will define your performance say, but at the same time if we include say you know your height or say you know your say academic cool level academic activities etcetera. So, sometimes what happens some variables become irrelevant are not explaining the dependent variable and then that variable we directly remove, but sometimes some variables are relevant, but they are linearly dependent. So, in that case which variable to remove and how to handle that multicollinearity that we need to study. And when you have a multicollinearity it is very difficult you know to estimate the regression coefficients or the parameters and also the standard error may be very high and also it is very difficult to fit the final regression because you cannot conclude that the regression that you have fit that is the best regression or best causal relationship between the independent variables and dependent variables because there might be some interdependency among the independent variable. This multicollinearity create many issues of regression analysis while fitting a causal relationship between independent variables and dependent variables, but in order to handle that first you need to understand the level of multicollinearity right. For example, suppose if you have a variable say you know say y and then say x 1, x 2. Suppose this data sets you have and x 1, x 2 are suppose linearly dependent to each other. You found high correlation coefficient between x 1 and x 2. In that case there is a multicollinearity in order to measure that level of multicollinearity there is a formula called variance inflation factor. How much variation is there between these two variables that you can calculate through VIF. This VIF is nothing but 1 by 1 minus r square. So, r square is the you know coefficient of determination which is nothing but for the particular variable suppose if you find the correlation between x 1 and x 2 you will find r square say r 1 square upon x 2. Similarly, if you have many independent variables you can find the correlation or say corresponding r square the square of it by taking the other independent variable put together and how much these variables are been dependent with other or correlated with other that you can define through individual r square of independent variables only. So, that independent variables if you calculate if you consider and if you find corresponding r square and this using this formula you will be able to find the variance inflation factor. If the like I can give one example here suppose you are observing that the correlation between x 1 x 2 are quite high say near to 90 percent and then both are independent variable and r square you found suppose say 90 percent. In that case your VIF variance inflation factor for that variables will be 1 minus 1 minus r square which will be nothing but suppose 1 we are calculating which will be 1 by 1 minus 0.9 which would be almost 10 which is quite high. So, in that case there is a general protocol or general standard you know assumption that if the variance inflation factor of a variables see if couple of independent variables are correlated to each other you might consider them the correlation also and you can fit the regression and conclude the final regression analysis but the level of correlation should be very low. Then only the little correlation or multicollerity can be accepted because sometimes variable are very important to consider and people might ask that what is the impact of that particular variable here. So, why you have not included that variable you can defend that you know there are interrelations between the independent variables. So, we have removed one of them I will discuss that you know the handling part of multicollerity but since independent variables has a merit in the regression process. So, therefore, if the variance inflation factor is less than 5 like using this formula you can see if less than 5 then you can say that know if there are less multicollerity the interdependency among the independent variable. So, we can accept that error and we can or multicollerity and you can proceed for the regression. If it is between 5 to 10 look at if there is no relation then r will be 0. So, variance inflation factor will be 1. So, there is a clear cut case that there is no interrelationship or no multicollerity among the independent variable but sometimes you will find of high r square. So, that means if r square value lies between you know closure to 0 your variance inflation factor will be good or low that means good. But if r square is closure to 1 you can see here if it is closure to 1 your variance inflation factor will increase. So, here you can see if the variance inflation factor is above then 10. So, it is a highly multicollerity but if it is in between 5 to 10 we call it as a moderately multicollerity you should handle it. But if it is less than 5 on an say it is not a benchmark point, but to some extent people follow these rules. So, if it is less than 5 people say that it is ok carry forward the regression analysis. So, this is the general concept of variance inflation factor through which you will measure the level of multicollerity. Suppose you have measured the level of multicollerity the next step is to assess it or handle it like deal the multicollerity. There are you know as I mentioned it will have a big impact in your final decision making the regression may not be acceptable if you have a high multicollerity your inflation in terms of error calculations will be high your correlation like coefficient of least square will be very absurd. So, you cannot accept it right you cannot come up with any conclusion. So, you have to handle the or remove the multicollerity there are two popular process one is that you know either you remove the variable who are saying same thing. Suppose two variables are saying same thing and they are correlated to each other you delete one variable from your regression and rerun the regression. Otherwise you can do the transformation sometimes suppose you take logarithm or some other function of a logarithm and Laplace transformation etcetera sometimes the multicollerity reduces. The best option is that remove the variable who are very close to each other and very strong correlated, but both are saying same thing or delete one. I will show you one example another option to handle the multicollerity is that increase the sample size suppose you have a data set say y say x 1 x 2 suppose another variable you can take you have the data sets and based on this say 20 sample data you are fitting the regression and you found there is a relationship between x 2 and x 3 multicollerity and either you can delete x 3 say which says the point number 1 and now point number 2 says that you increase the sample size you collect more sample in that case you might see that the multicollerity may go or may not remain there in your data sets. So, if you increase the sample size in that case also you can manage the multicollerity or you can reduce the level of multicollerity and the corresponding variance inflation factor. So, these are the two options, but in general what happens getting more sample may not be easy because company might have already provided or you already have a data sets and based on that you have to take a decision. Increasing sample size is a good decision process or decision making process, but that is time consuming you have to collect the data etcetera again, but based on the available data or historical data if you take a decision and where you found that you know there is a strong multicollerity we have value is very high in that case the best option is that remove one variable. How to do that let us illustrate that part here I have taken one sample example very basic example you will get the clear idea about the multicollerity. Here you can see we have taken two independent variable and one dependent variable right and the data sets are there we have fit the regression by excel the way I have shown you in simple linear regression you know different measure of fitness of good and the R square calculation process standard error calculation process when you illustrate a detail of that and the multiple regression process we have discussed how to handle all these things and how to get this summary sheet of regression analysis and how to read the entire novel table and the p-value etcetera overall a value and the p-value all these things we have discussed right. So, we are not going to repeat that just to illustrate the multicollerity part let us see the table. So, here we have two independent variable and dependent variable we have fit the regression and we found the regression analysis you might say that it is done, but here you can see the p-values are none of them are significant all of them are greater than 0.05 right. So, this first indicates that the regression that you have put your R square might be high look at the R square, but the p-value none of the p-value are significant. So, it signifies that it indicates that there are some issues. Now, we will check the multicollerity we will calculate the various inflation factor. Before that we have tested the linear relationship the correlation coefficient between the independent variable. So, the two independent variable we have taken and we have found their correlation coefficient here you can see 90 percent minus 0.9. So, there is a negatively correlation it is negatively, but it is a strongly correlated between the two independent variable almost 90 percent say. So, therefore, you can say that two independent variable of this particular problem are highly correlated there is a multicollerity right because the correlation coefficient is quite high it is minus 0.9 and if you calculate the variance inflation factor we found it is 5.75 which is higher than 5. So, it is a moderately multicollerity exist over there. So, we have to remove it right you cannot include both the independent variable in your regression analysis. We have fit the regression and we found their R square and the corresponding analysis and we found that the multicollerity exist over there because VIF value is quite higher than that. You can do that 1 by 1 minus 0.91 0.91 square you can get to know or R square value you can put directly you will get to know the variance inflation factor value. Now, we observe that there is a multicollerity how to remove it what we have done we have taken only x first with y and we predicted y through x. So, we found the simple linear regression now and we have done the same thing for x 2 also both regression we found. And here we observe that look at one observation before I conclude this example you see here for x 1 here you can see the coefficient value the beta value is positively correlated to the y right it is positively explaining the y. But if you take the overall you know the regression which I have shown you in the previous slide overall regression here you can see here also look at the overall regression here you can see it is negatively explaining the y. So, look at here minus 0.29 to the coefficient part, but here it is plus you can see plus 1.1. So, it is completely absurd data you are getting the least square values you are getting the coefficients you are getting right. So, you cannot conclude anything whether x 1 is really explaining y or not when it comes to the joint venture of you know joint regression of x 1 and x 2. So, and also you can see here the p value of this regression while you are running the regression of y with x we can see the p value of the intercept part is quite high 0.71. But in x 2 for the case of x 2 here you can see the p value is significant in both cases. So, therefore, in this particular example either you can take x 1 also and x 2 also, but the best recommendation is that since the p is quite significant for both the intercept 1 x 2 for this particular variable of x 2 we will conclude that include x 2 in your final regression and exclude x 1 from your regression analysis. And the final recommendation is that consider y and x 2 and predict the causal relationship between y and x 2 done x 1 we have removed. So, this is what the multicollinearity and how we can handle the issues of multicollinearity right. Even we have tested the data the data that we have taken here you know with this data we have tested which one is making accurate forecast by putting new data into the you know x 1 value and x 2 value in this individual regression. And we found that x 2 is always better in terms of performance and the accuracy of prediction. So, we conclude that x 2 is better. Let us see one more example to get a better clarity about multicollinearity practical example. Here you can see we have taken one example where the question is that the objective is that predict the person's height by means of their foot length. Like we all know right when you you know measure one person's height we all know that you know the foot length of the person's maybe left foot or say right foot that can give a tentative prediction about your height of one person's height. So, here we have considered the height as the dependent variable and the independent variables are foot length, left foot length and right foot length. And the data suppose we have the data right. And we will fit the regression how the foot length of left foot length and right foot length explain the height. We have done the regression using excel and we found this relationship as it is we calculated the final regression line here right. And we found overall r square is quite good and look at the p value of this data set. Look at both p for left to left foot and right foot are high not even you know p should be less than 0.05 right 0.05 here. Overall p is overall significant overall anovapy is quite good after a test. But when you come to the individual level because of overall regression are saying good and r square is high you cannot conclude that the regression is fine and let us carry forward the forecast. But when you come to the individual p value through t test you find that none of the you know the predictor variable or explanatory variables are significant all of them are quite higher than 0.05. So, you cannot conclude this regression as the final regression. So, you have to check whether there is a multicollinearity or not right. Here we found a strong correlation between left foot data and the right foot data. Look at here correlation coefficient we have calculated when we found a strong correlation 99 percent correlation among the data between left foot and right foot between the left foot and right foot. So, we found a strong correlation between these two data sets. So, what we have to do now? We have to find the VIF variance inflation factor between an x 1 x 2. So, we have done the regression among x 1 x 2 or say left foot and right foot or say we have calculate the correlation coefficient. And we found look at the data we have plotted here and we found variance inflation factor is 143 because of 0.99 here VIF is coming up to be 143 which is quite higher than 10. So, you cannot include these two variables together left foot and right foot in measuring your height right. So, we will not consider two independent variable regression analysis here. We have to exclude one of them first we have taken the left foot and we have predicted the height left foot and the height. And we found look at here both p value for both of them specifically for the left foot it is quite significant and overall intercept also significant r square is also high you can fit the regression line as as good as strong prediction single left foot is sufficient to make the prediction. Let us see what happens with the right side foot and we when you consider the right hand foot we found the r square is also very strong p value is also significant. Therefore, this regression is also good. So, now the question is that which foot to consider left foot or right foot in final recommendation. In the previous example, example 1 we have considered only one like x 2 as the final decision right final independent value to include or to consider in your final regression and x 1 we have deleted we found the reason for that because p value are not significant. But here for both the independent variable the p is significant. So, which one to consider here the strategy is that consider either of them both are ok either you can consider left foot to measure your height in the regression process or you can consider based on this data sets can right side foot also. So, here you can see both are ok you can consider either or both the independent will are explaining effectively to the dependent variable say height and in this case the solution is that consider either of them. So, depending on the situation case you have to handle the multicollinearity. So, what is the conclusion of multicollinearity in multiple regression? If there is a relationship strong relationship or multicollinearity between the independent variables first of all you cannot include them you have to remove them right. But the which one to remove we calculate the variance inflation factor and the individual level of regression analysis and see which one through p value or overall test analysis significant test you check which one is measuring or explaining the dependent variable better with strong like you know summary data of regression coefficients and say you know error standard error r square and the p value. Based on that you make the recommendation which variable to include which variable to exclude and in case if you want to add more data sample data probably you without excluding the variables you can manage the multicollinearity also. So, these are the couple of way to manage the multicollinearity in multiple regression, but make sure that you cannot ignore the multicollinearity if it is exist among the independent variable you have to handle it effectively. So, I believe it is clear to you all the basic understanding of multicollinearity and how to handle it in practical cases.