 multi-collinearities and other commonly misunderstood features of regression analysis. Multi-collinearity refers to a scenario where the independent variables are highly correlated. It is quite common to see studies to diagnostics to detect multi-collinearity and then drops and variables from the model based on some statistics that indicate that multi-collinearity could be a problem. There are quite a lot of difficulties or problems with data approach. Let's take a look at Heckmann's paper. So they identified that the customer race and customer gender were highly correlated with physician race and physician gender and therefore they decided to drop customer gender or customer race from the data because that caused the multi-collinearity situation because these variables were correlated with more than 0.9. So what is this issue about and why would one like to drop variables? Multi-collinearity relates to the sampling variance of the OLS estimates or generally any estimator that estimates linear model. So to understand multi-collinearity let's take a look at the variance of the OLS estimates. So the variance of the OLS estimates is given by that kind of equation here and this equation is also used for estimating the standard errors. So what this equation tells us that the variance of estimates depends on how well the other independent variables explain the focal independent variable whose coefficients variance were interested in. So when this r square here goes up, then the variance of the regression coefficient increases. The reason is that when this r square goes up, then 1 minus r square approaches 0 and then when you multiply something with r, something that approaches 0 then the multiplication, the result will approach 0 as well and when you divide something by something that approaches 0 then you will get a large number. So when this r square increases, when the focal variable is increasingly redundant in the model so it provides the same information as the other variables, then the standard error will increase. So the r square when variables are more correlated then the estimates will be less efficient and less precise and also the standard error will be larger because standard error estimates the precision of the estimates. So is that the problem? Well that depends. Let's take an example of what will happen when we have two highly correlated variables and what it means for the regression analysis results. So we should expect when two variables are highly correlated to the regression results to be very imprecise so that if we repeat the study over and over many times the dispersion of the estimates over multiple repeated samples is large. So here we have a correlation between x and y at 0.9 which is model based on Heckmann's paper and let's assume that the correlation between x and y varies between 0.43 and 0.52. So this is the variation or this kind of dispersion could easily be a result of a small sample. So let's assume that this is the 0.475 is the population value with the sample size of for example 100. It is very easy to get a sample correlation of 0.43. Then we have correlation between x, 2 and y model the same way and we have five combinations of correlations. So these correlations vary a little. So correlation between x and y vary a little. Correlation between x, 2 and y vary a little. Because x1 and x2 are correlated when we calculate the regression model using these correlations the regression estimates actually vary widely. So in this model the regression coefficient is minus 0.2 and here it's plus 0.7. So we have even the sign that is flipping. Now the multicollinearity problem relates to the fact that because x1 and x2 are so highly correlated then it is very difficult to get the unique effect of x1 because x1 changes in x1 are always accompanied by changes in x2. So we don't know which one it is. Consider company size and company's size in revenue and size in personnel. Those are highly correlated. Not that 0.9 but still highly correlated. So it's difficult to say whether for example investment decisions depend more on the number of people or the revenues of the company just based on statistical means. So what's the problem? The problem here is that if we want to say that this effect of beta 1 is 0.25 and not 0 we have to be able to differentiate between these two correlations. And how much sample size would we require to say for sure that correlation is 0.475 instead of 0.45 or 0.5. We have to understand the sampling variation of a correlation. So the standard deviation of correlation of 0.475 with different sample size is 100 is 0.05 so if our sample size is 100 then we can easily get something like 0.43 or 0.52 which are less than one standard deviation from this mean. So we can easily get these kind of correlations with sample of 100. So when our sample size is 100 and X and Y are correlated at 0.9 we really cannot say which one of these coefficients is the correct set because our sample size doesn't allow us enough precision to say which of these correlations are the true population correlations that determine the population regression coefficients that we are interested in. So the fact that these two variables are highly correlated it kind of amplifies the effect of sampling variation of these correlations. So the sampling variation of correlations here is small but because X and Y are highly correlated that amplifies the effect on these regression coefficients here. To be sure that the model 3 is actually correct so that two standard deviation difference two standard deviation of correlation wouldn't be enough to get us from model 1 to model 2 we would need a sample size of 3000. So when variables are highly correlated that is referred to multi-colonari. So it refers to the correlation with the independent variables. It has nothing to do with the dependent variable here. And it increases the sample size requirements for us to estimate effects. And this sample size or this inflation of variance of estimates is quantified by the variance inflation factor. There is the variance inflation factor basically what it quantifies is that how much larger the variance of estimates is compared to a hypothetical scenario where the variable would be uncorrelated to every other variable. So the variance inflation factor is basically it's defined as 1 divided by 1 minus r square of the focal variable on all other independent variables. So it's this part of the model here. So when that goes to 0 then variance inflation factor goes to infinity. When that is exactly 1 then variance inflation factor is 1 which means that we in the multi-colonarity is not present at all in the model. There is a rule of thumb that many people use that the variance inflation factor should not exceed 10 and if it does we have a problem and we don't have a problem. So in the previous slide I showed you that if there is a 0.9 correlation with two variables that makes it very hard to say which one of those is the actual effect because they co-vari so strongly together. So what is the variance inflation factor when correlation of x1 and x2 is 0.9? We can calculate the variance inflation factor by taking a square of this correlation so r square is the square of correlation so that's 0.9 to second power and then we just plug the number here use a math and we get a variance inflation factor of 5.26. So in the previous example we would have needed 3,000 observations to say for sure that model 3 was the correct model and not model 2 or model 4 but variance inflation factor wouldn't detect that we have a multi-colonarity issue. We had, so what does it say about this rule? It is, it's not a very useful rule. Getokivi and guide make a good point about this rule and any rules in general in Journal of Operations Management Editorial. So this is from 2005 when Getokivi and guide took over Journal of Operations Management as editors of chief and they first published an editorial of what is the methodological standard for this journal and they identified some problems and they also identified places for improvements. So what you should not do and what you should do and they emphasize that you always have to contextualize all your statistics. Like when you say that a regression coefficient is 0.2 whether it's a large effect or not depends on the scales of both variables and it also depends on the context. If you get a thousand years per year more for each additional year of education that's a big effect for somebody and it's a small effect for another person depending on how much there are with the person lives how much the person makes so all of these statistics the interpretation requires context and they take aim at the variance inflation factor as well. So variance inflation factor quantifies how much larger the variance would be compared to if there was no multi-colonarity whatsoever between the independent variables. And they say that if your standard errors are small from your analysis then who cares that they could be smaller when your variables independent variables will be completely independent which is a non-realistic scenario anyway. So if the standard errors indicate that their estimates are precise then who cares? They are precise and that's what we care. So variance inflation factor doesn't really tell us anything useful. On the other hand they also say that in some scenarios the rule of thumb that variance inflation factor must not exceed 10 is not enough. So in the previous example we saw that there was 0.9 correlation corresponding to variance inflation factor of 0.5 which severely made it a lot more difficult for us to identify which one of those models was correct. So we had a collinearity issue it wasn't detected by variance inflation factor. So the variance inflation factor as Ketokivian guide say stating that it must exceed a cut-off without considering the context is nonsense. So that's what they say and I agree with that statement fully. So you have to always contextualize what does a statistic mean in your particle study. Woolridge also takes some shots at variance inflation factor and multi-collinarity. So this is from the fourth edition introduction and he didn't address multi-collinarity in the first three editions of his book because he thinks that it is not a useful concept or it's not important enough. Regression analysis does not make any assumptions about multi-collinarity. It makes an assumption that its independent variable should contribute unique information. So the variables can't be perfectly correlated but it doesn't make any assumptions beyond that. He decided that he's going to take up this issue because there's so much bad advice about multi-collinarity. So he says that these explanations of multi-collinarity are typically wrongheaded. People are explaining that it is a problem and then if you have variance inflation factor more than 10 you have to drop variables without really explaining the problem and what is the consequence of dropping variables from your model. So let's now take a look at what it means to solve a multi-collinarity problem. So to understand the multi-collinarity problem multi-collinarity is a problem in the same sense that the fever is a disease. It is not really a problem per se, it is a symptom and you don't treat the symptom, you treat the disease. So if you have a child who has fever typically cooling down the child while putting them outside the cold temperature is not the right treatment. You have to look at what is the cause of the multi-collinarity cause of the fever and fix the cause instead of trying to fix the symptom. The typical solution for multi-collinarity problems so how do we make X1 and X2 less correlated? Well we just drop one from the model. So let's say we drop X2 from the model and that causes in the correct model in the previous example the correct model was that the effects were 0.25 both and now if we drop X2 then the estimate of X1 will reflect the influence of X1 and X2 both. So what will happen that we will overestimate the regression coefficient beta 1 by 90% and the standard is smaller so we will have a false sense of accuracy or on related to this severely biased estimate. And also if you have control variables that are collinear with one another that is irrelevant because typically we just want to know how much of the variation of the dependent variable is explained jointly by those controls and we're not really interested in which one of the controls actually explained the dependent variable. Correlation between collinearity between the intermediate the interesting variables and the controls are important but if you are just focusing on control then it doesn't matter. Okay so treating collinearity as a problem is the same thing as treating fever as a disease. So it's not a smart thing to do. We have to understand what are the reasons why two variables are so highly correlated that we can't really say which one is the cause of the dependent variable. So there are a couple of reasons why that could happen. So multicollinearity could be happening because you have mindlessly added a lot of variables into the model and you shouldn't be adding mindlessly variables to the model all variables that go to your model must be based on theory. So just throwing 100 variables into a model typically doesn't make sense. Your models are built to test theory and then they must be driven by theory. So what you think has a causal effect on the why variable must go into the model and you also must be able to explain why and what's the mechanism that its independent variable influences the dependent variable causally. So that is one. You have been just mindlessly data mining and that's a problem. So multicollinearity is not the problem here. The problem is that you are making stupid modeling decisions. The second problem is that you have district constructs but their measures are highly correlated. And here the primary problem is not multicollinearity but it is discriminant validity. So if two measures of things that are supposed to be distinct are highly correlated, it's a problem of measurement validity. I'll address that in a later video. Then you have two measures of same construct in the model. For example if you are studying the effect of company size then you have revenue personnel both as measures of firm size in the model. It's not a good idea to have two measures of the same thing in the model. Let's take an extreme example. Let's assume that we want to study the effect of person's height on person's weight and we have two measures of height. We have centimeters and inches. It doesn't make any sense to try to get the effect of inches independent of the effect of size. In fact that can be even be estimated. So if you study multiple measures of the same thing then typically you should first combine those multiple measures in the single composite measure. I'll cover that later on. Then the final case is that you are really interested in two closely related constructs and their distinct effects. So that is for example you want to know whether a person's age or a person's tenure influences the customer satisfaction scores that the doctors give to the patients like in HECMAN study. Then you really cannot drop either one of those. You can't say that because tenure and age are highly correlated. We are just going to use omit tenure and assume that all correlation between age and customer satisfaction is due to the age only and tenure doesn't have an effect. So that is not the right choice. Instead you have to just increase the sample size so that you can answer your complicated research question in a precise manner.