 In this video I will explain the instrumental variable solution to endogeneity problem. To understand the instrumental variable solution we first need to understand the endogeneity problem. I explained the problem in more detail in another video, but this is just a quick recap. So endogeneity occurs when we have a regression model such as the one here, shown graphically here as a path diagram. The error term U presents any other causes of Y that are not included as the explanatory variables. So if any other cause of Y is correlated with one of any of the included variables X, then we have an endogeneity problem. For example if we are trying to explain companies performance, let's say RLA here, and we are trying to explain performance with whether a company invests in a new manufacturing plant or not, then both investment and profitability probably depend on company strategy, in which case strategy is an omitted variable that is correlated with X1 and that leads to endogeneity problem. More generally if we look at the problem of X and Y in just a bivariate case in more detail, we have the correlation between X and Y is this direct path here plus this correlation times 1 because the path for error is constrained to be 1. So the correlation between X and Y is the direct regression path plus the spurious correlation because X correlates with the omitted causes U. The problem is that we just observed this one correlation so we have one unit of information from the data and we want to estimate two different parameters. This is under identified model. The decrease of freedom is minus 1 which means that the model can't be meaningful estimated. So we can't estimate two different things from one thing. To solve this problem we can apply instrumental variables. The idea of an instrumental variable is that we get a third variable Z that is correlated with X. That is a correlation that we can test empirically and that we can assume it's uncorrelated with U in the other causes. What qualifies how we find these instruments is a difficult problem because we cannot generally test the correlation with Z and U empirically. We have to argue that based on theory. I'll show you an example soon but let's take a look at the principle first. Let's assume we have a valid instrumental variable so that the only reason why Z and Y are correlated is because Z is correlated with X and then Z can't be correlated with Y. When we have these correlations, the correlation between Z and Y is then a correlation between X and Z comes from the path analysis tracing rule. So we take that correlation and then this direct path to get from Z to Y. Correlation between Y and Z and Y is beta times correlation X and Z and from here we can solve for B using correlations X, Z and ZY which are both observable quantities and that gives us a consistent estimate of beta. That's a way to estimate beta. A variable Z qualifies as an instrumental variable if it qualifies for two criteria. First it must have relevance for X. So X and Z must be correlated. That can be checked empirically. It just calculates the correlation and we do a statistical test for the correlation. Then there's exclusion criteria which has to be argued based on theory because we don't observe U, we can't test whether Z and U are uncorrelated. That has to be argued based on theory. That is difficult to do. Let's take a look at examples. So in Mockham's paper they apply instrumental variables. To understand the instrumental variable used here we have to understand first what is the endogeneity problem that they are doing. So what's the ACY instrumental variables? Their dependent variable was point acquisition so people are acquiring points in a service and they are testing whether the decisions to like the Facebook page of that service leads to more point acquisition. And they did an experiment. So they have this randomization step here. So they are invited some people to like the page that they were studying and the rest were controlled. So this is randomization and it is exogenous because there is no reasonable way that a computer, a random number generated on my computer will be correlated with behavior of actual people. So it's implausible to claim that this would not be exogenous. So randomization is exogenous. Then we have endogenous selection. The reason why this selection is endogenous is that when you're invited to like a Facebook page of a service that whether you accept the invitation or not probably depends on how much you like the service, how much you use the service and so on. So there are probably multiple different causes that influence whether you choose to accept the invitation to like the service that also influence how active you are in the service acquiring points. So comparing those that chose not to like against chose that did like the page is not a valid comparison because these two groups of people are not comparable. That is we have endogenous selection here. So we have basically a few options. We can compare between treatment and control here. But that doesn't really give us the effect of a like because these people in the treatment some of them chose not to like the Facebook page. Also some people in the control could have liked the page anyway. So comparing the treatment and control on points acquisition doesn't really allow us to do what we want to do. We can't compare between chose to like and chose not to like because this is an endogenous selection. And we can't compare these that chose to like against control because the control contains people that would have chosen not to like had they been asked. So these two are not comparable either. What we can do here and what Moccon and I'll do is they apply instrumental variable technique. So the idea is that the treatment, the randomization here is correlated with choosing to like. So if you ask some people to like a Facebook page and you don't ask the other group then those people that you ask are more likely to actually like the page. And this can be established empirically. So they can calculate this correlation here and they can establish that the treatment is a relevant instrumental variable for choosing to like. So it fills the relevance criteria. The treatment also fills the exclusion criteria because the treatment is randomized. It is very unlikely that this treatment actually correlates with any other reason that an individual person would have used to like the page. So when we have a random number basically on our computer which assigns people the treatment or control then that is independent of any attribute of those people that we randomized. So it fills the exclusion criteria. Then they can apply these equations to calculate what is the effect of one way Facebook like. In practice we don't work with these equations because we usually have multiple different variables. We have controls and we can have multiple instrumental variables as well. So we use some other technique and one of the simplest technique is called the two states least squares. The idea of a two states least squares is that when we take the instrumental variable Z then instead of just saying that these are correlated we regress X on Z and then we calculate things based on these regression. So let's see how it works. So we have first this is an endogenous regression analysis. So we have Y, if we regress Y on X we have an endogeneity problem because some causes of X are correlated with some causes of Y. Then we have the instrumental variable here Z. So we say that X is actually a sum of Z multiplied by beta 2 plus the error term from that regression analysis. So we have the regression analysis for the first regression of X on Z here and then we have that makes the second regression. Then we can multiply this out. So we have this beta 1, beta 2, Z. That's the effect. And this is typically implemented by running two sets of regression. So this beta 2 Z is a fitted value of a regression analysis of X on Z. So in practice we implement this model by first regressing X on Z then we take the fitted values of Z and then we regress Y on the fitted values of X from the first regression. So we run first regression to get fitted values then we run the second regression on the fitted values and that gives us consistent estimates of this relationship. If you have more than one independent variables if we have five independent variables then we regress each one of those five independent variables on the instruments separately. If we have variables that are not endogenous then they qualify as instruments as well. We take fitted values of each of those five regression analysis and use those fitted values to explain Y. And that will produce consistent estimates of beta Y under the assumption that Z is relevant and does not correlate with the omitted causes of Y.