 Instrumental variables are useful because they can be used for correcting for endogeneity. But they are also useful because they can be used for testing endogeneity as well. Why would one want to test for the presence of endogeneity data? The reason for that is that when you use instrumental variable estimation, you typically lose lots of efficiency. So the estimates become more less precise than they would be without the instrumental variables. Also some instrumental variable techniques, such as two-test least squares, is biased in small samples and in the presence of weak instruments. For these reasons it is useful to test for endogeneity and then make your decision on whether to apply instrumental variables or not based on that test. Of course endogeneity should also be treated theoretically so you should think through what are the plausible reasons for endogeneity in your data. Articles that present guidelines on endogeneity analysis or instrumental variables typically list quite a few different techniques. These different techniques are based on a few simple ideas. How they then differ beyond those ideas is that some of them make different assumptions based on whether you can assume homoscedasticity or whether you cannot assume it, whether you have independent observations or not. Let's take a look at what the basic idea of these tests are. This is the simple instrumental variable model. The endogeneity, no endogeneity, is that X is uncorrelated with UY. If Z is a valid instrument, then the only reason why X might correlate with UY is the correlation with UX, the error term of the first stage regression, and UY, the error term of the main regression from the second stage. The testing for endogeneity basically boils down to testing if this correlation is non-zero. If the correlation between those two error terms is non-zero, then we have an endogeneity problem, assuming that the instrument here is valid. So how do we go about testing that assumption? Well, the simplest thing is if you use SCM, you'll get an estimate for this correlation after you've freed and you just take a look at the Z-test for that correlation. If that is significant, then you conclude that we have evidence that there is actually an endogeneity problem in the data, and then using instrument of variables is warranted. If it's non-significant, then we conclude that the instrument of variables are probably not problem, or endogeneity is probably not a problem for our data, we can probably forgo the instrument of variables. Then we have three other a bit more complicated tests. The first one is an estimable comparison. So we compare two models. One is the instrument of variable model. Another one is a model where we assume that there is no endogeneity by considering this to be zero, and then we compare the models. The second is an augmented regression model where we take the residuals from the first stage regression analysis, we use them as a predictor of Y, and then if those predict Y, then we conclude that there is an endogeneity problem. And the third one is the general Hausmann specification test, which compares an efficient estimator. In this case, our normal regression analysis against consistent estimator, in this case instrument of variable estimator, and checks if those are sufficiently different that we can conclude that one of the estimators must be inconsistent. Let's take a look at what these tests do in a bit more detail. So this is the next model comparison. We need an unconstrained model. Then we take something out from that unconstrained model to come to the constraint model. So this is a zero decrease of freedom unconstrained model, and we take the error correlation here out, and that gives us the constraint model, which is one degree of freedom, and then we compare the models. Of course, this has one degree of freedom, so we could just as well test it directly instead of having the saturated model here as a baseline. But if we have more than one instrument, then this nested model comparison is required simply testing this model. It's not as good as us doing this because this is a more focused test. Then we have the augmented regression analysis. The augmented regression analysis is the idea is that we do the first stage regression of the two stage least squares. So we regress the endogenous variable X on the instrument, and then we take, normally in two stage least squares, we take the fitted values and we regress Y on the fitted values. So that's the second stage regression analysis. In the augmented regression test, we include also the residual, the difference between the observed X and the predicted X to the second stage regression analysis. This is going to be uncorrelated with the fitted X, so it doesn't really affect our estimate beta here at all, but the beta residual here, whether the residual explains Y, is our test of endogeneity. If the residual explains Y more than at all, then we conclude that we have an endogeneity problem. And then we need to use instrumental variables. If the coefficient beta residual is not significant, then we would conclude that there is not enough evidence to conclude that we have an endogeneity problem and therefore we don't need to use instrumental variables. Then the final test is the Hausmann specification test. This is a general specification test and I'll explain it in more detail in another video, but the idea here with the test is that we have an efficient estimator and we have a consistent estimator. So we know that all regression here is efficient, so it's the most precise way of calculating the relationship between X and Y, assuming there is no endogeneity. If there is endogeneity, then all regression is going to be inconsistent. Instrumental variable estimator on the other hand is consistent under endogeneity, but it is inefficient. So if there is no endogeneity, then instrumental variable and all regression are both consistent, but instrumental variable estimator is a lot less precise. The idea of the Hausmann test is that we compare these two estimators, how much they differ from a sample, and then we compare that against the estimated difference of their variances. The intuition behind the test is that if the difference between two estimators is large, then we cannot attribute that difference to the differences in efficiency, but it must be because one of these estimators is inconsistent. In this case, we will conclude that the OLS is inconsistent because the instrumental variable estimator is consistent under more general assumptions. I'll explain this test in more detail in another video, but this is a very general test. It sometimes requires large samples to work, but it's something that's highly useful in different contexts. So that was the basic ideas of these tests, and these four different basic ideas underlie all these tests listed, for example, here, and which test you pick, then depends on what test your statistical software offers and what estimation technique you apply for estimating the model. If you apply, for example, two-stage least scores, then you can't do likelihood ratio test because that requires maximum likelihood estimation and so on. But basically, then it becomes just a matter of preference after you have checked which of these are applicable and then you pick the one that you like the most.