 Hi, this is Dr. Justin Essary. This is week eight of PolySci 509, the linear model. And today, we're going to continue talking about violations of the classical linear normal regression model. Last week, we talked about three common violations of the CLNRM assumptions and what consequences those violations might have for your regression analysis. Here you can see those three topics from last week. The first one was heteroscedasticity, which in plain language just means a non-constant error variance in the regression. And this was a concern when the variance of the regression u was correlated with one of the regressors x. And when this happened, we had an efficiency problem. The standard errors of beta, the estimated betas from the regression, could be too big or they could be too small. And we wouldn't necessarily know which. Very often, those standard errors are too small, meaning that our estimates were overly confident and we would reject the null hypothesis too often. And we talked about a number of corrections for that problem, including various heteroscedasticity consistent VCV matrices. We talked about omitted variable bias, which I don't think I need to even write a note here because it is exactly what it sounds like. It's the bias in beta estimates that occurs when a model is mis-specified. And as we discussed, the mis-specification of a model, specifically leaving out one of the necessary components of the regression, leaving out a regressor, is only a problem if the omitted regressor is correlated with both the dependent variable and the independent variable that you're truly interested in. So control variables are often included in regressions, but sometimes they're omitted because we don't even know they belong or because we can't collect them. That is a problem if and only if the omitted regressor, the omitted control variable, is highly, well, really at all, but especially highly correlated with the independent variable that we truly care about. And then finally, multicollinearity was a problem that we discussed last week. Multicollinearity is, in plain language, correlation among the x variables. So the problem here is all of our x's are correlated or possibly correlated with each other, so correlation among the regressors. Correlation among regressors. And as we saw, this creates a real problem when the regressors, the x variables, the independent variables are so correlated with each other that we can't really tell the difference between them. And in particular, can't tell the difference in how they influence why. So multicollinearity obscures our ability to differentiate the causal or correlational relationships between each individual independent variable and the dependent variable. And as we saw, this is really a problem when multicollinearity reaches a very high level, let's say around 0.9, 0.95. At lower levels of multicollinearity, it can be a bit of a problem, but not quite as big of one. So this week, we're going to continue with two different violations, measurement error and indigeneity, and we'll talk about what each of those means in the lecture. And unfortunately, we're going to find that these violations are both more consequential in some ways and also harder to fix than the last three we examined, but nevertheless, we'll have some suggestions. I'll have some suggestions to advance, some things that you can do. So without further ado, let's get started. First topic we're going to deal with today is measurement error. And I've written down here what may appear to be a somewhat silly question. What is measurement error? Well, it's the error you make when you measure something and you measure it wrong. We'll get a little bit more technical about it, though. So let's presume that the true DDP looks a little something like this. So here's the true data generating process. It looks like y0 equals x0 beta plus u0. So we actually observe and run. So we run a model that looks like this, but isn't quite this. We run y equals x beta. So we're estimating u0 with u hat. So if you want, you can put in the x beta hat plus u hat kind of thing here. But actually, I'm going to leave it the way it is, because the important bit here is that y is not y0. It's y0 plus v, where v is a measurement error term. And x is x0 plus w. And v and w are measurement error terms. Now these measurement errors are typically presumed to be benign in the sense that they're, for example, independently and identically distributed and mean 0, meaning there is necessarily consistent bias in measurement. We could incorporate that if we wanted, but very often it's just assumed that this is pure noise, noise in data collection. So if we were to take these measurements and substitute and solve, what we would get is y0 plus v. So we're taking this and substituting it in right here. y0 plus v equals x0 plus w beta plus u0, or plus, let's say u0. So I'm going to put in the u0 term here. So now we're putting in this and substituting that right here. Actually, I made that two lines. I'm going to make that two, make that three, three. So we're putting that in there, the three line, and the three line, and the two line, and the two line. Now, substitute and solve, what do we get? Well, I can rearrange this term to be y0 equals x0 beta plus w beta plus u0 minus v. So this, let me get the red pen here. So this right here is a mishmash of error components, measurement error terms, and causal error terms, as it were. And we can't necessarily figure out what these things are. For example, we don't observe w. It's a measurement error term, so we can't determine the term w beta. We can't determine u0. It's an unobserved error term, all we could ever hope to do is estimate it in a perfectly specified model. And v is a measurement error term. We can't see this either. So all of this stuff in our regression sort of ends up as estimated error. It's all estimated error together you have. And as you might imagine, this causes a problem because you notice that this estimated error component is correlated with some things. It's correlated in particular with beta. So if w is non-zero, we might have a problem there. And we might have a problem in estimating. We might have a problem that v is non-zero because v is correlated with y. So we could have components of the error correlated with a dependent variable. We could have components of the error correlated with estimated regressors, I'm sorry, estimated coefficients sizes. So as you might imagine, from the two things I just described, there are some consequences to measurement error in the OLS context. Some are less pernicious than others. We'll start with the least pernicious, which is just simple inefficiency. Inefficiency. So the estimated error term u, or actually I should say the composite error term u, consists of the original error term from the correctly specified model plus the w beta term minus the v term. And therefore, the variance of that error term u, and we'll say the variance of the estimated error u hat, is 1 over n minus k u hat transpose u hat, whoops, I got that wrong, which is going to be greater than the variance of the original error component u0. So in other words, if we had no measurement error and we estimated the model with no error and we therefore got an estimate of u0, and actually this is going to be really the variance of the estimated u0, its variance is going to necessarily be less than our estimate of u hat, which is going to include not just u0, but all of these other terms that are getting wrapped up in it. In other words, when we make measurement errors, those measurement errors get captured or absorbed into the regression error term, and therefore the regression error term just gets bigger. And so when we calculate the variance of that bigger error term, we get bigger variance. The practical consequence of this is that the standard errors of the beta coefficients are inflated because as you may recall, the variance covariance matrix of beta comes out of u transpose u or u hat transpose u hat. Actually it comes out of u hat, u hat transpose, my apologies. So bigger u hats mean bigger standard errors of beta, which means harder to reject the null than it would have to be. An optimal estimator would reject false negatives. I'm sorry, let me put that another way. An optimal estimator without measurement error, in other words, an optimal model that excluded any measurement error would have fewer false negatives than this one would because it would have smaller standard errors, smaller standard errors of beta. That's all bad, but comparatively not so bad because certain types of measurement error can actually give us even bigger problems. In particular, we can get bias in beta coefficients as a result of measurement error. How do we get bias in beta coefficients as a result of measurement error? Well, you may remember that the unbiasedness proof of beta in the classical linear regression model relies on a particular assumption. So the CLRM relies on the expectation of u given x equaling zero. But the expectation of our composite error term given x now is the expected, I should have put an expectation term here, give it to that, is now the expectation of the original error term plus w beta minus v given x zero plus w because x is x zero plus this measurement error. So now we can break this into three parts. The expectation of u zero given x zero plus w. The expectation of w beta given x zero plus w. And the expectation of v given x zero plus w. I'm sure I should have subtract that there. Well, okay, I can probably say u zero is uncorrelated with x zero or w, so I can get rid of that. And same for v, measurement error in y is probably unrelated to x zero or measurement error in w. But this third term, you know, the expectation of w given w is not zero, it's w. So this term is non, this term right here is non zero. Therefore the expectation of u given x is w beta, which is not zero. So what this is telling us is that if we have measurement error in x, measurement error in x results in a bias problem. Whereas measurement error in y results in efficiency problems only. So if we make mistakes measuring the dependent variable, we can expect our standard errors to be too large and we can expect to reject the null hypothesis of it too often, have too many false negatives. But if we mess up in measuring x, the regressors, not only will we get efficiency problems, which I should say are still wrapped up in that, but more to the point, we get bias problems. We can get incorrect estimates in beta hat, which will give us answers that are too big. We can say that the relationship between x and y is too big or too small. And I should say typically measurement error of bias is beta downward. I could maybe think of some special cases where it both buys it upward, but the usual case is that it biases it downward, if I recall correctly. So in short, if you have measurement error in x, you have a really big problem to deal with. That is a very bad thing. So now we could ask, how do we deal with that? Well, we'll talk a little bit about how to deal with it, but before we do that, first I wanna give you some applied examples in R showing you how this works. So I've given you some formal intuition about why measurement error can cause a problem. Now what I wanna do is give you some actual simulation evidence, some applied evidence, so to speak, that this is a problem. So I've prepared an R script here that you can follow along with. So the first thing I'm gonna do is just remove everything from the memory and set my seed equal to one, two, three, four, five, six, so they all get the same answer. And I'm gonna generate a data set x and y. And as you can see, x and y come out of a very simple classical linear normal model. There's a normally distributed error term and the DGP is two plus three x. So if I run a regression on this data, this is a data set of size 100. You can see that we more or less recover the DGP in this sample data set of size 100. We're supposed to get two plus three x, we get one point eight plus three x. Great, close enough. Now what we wanna do is I'm gonna add a normally distributed error term onto y and see what I get. And based on what I just showed you, I expect the standard errors of both these coefficients to go up. So I'm actually gonna expand this bottom pane a bit so we can look at these together. When I run this regression, okay, this is with a normally distributed error component on y only. Now what happened? The standard errors on both the intercept and x grew. The standard error on the intercept one from point one nine to point three three. The standard error on x went from point three to point five. Now these were already highly statistically significant quantities and so we still didn't get the false negative that we were a little bit worried about probably because the signal was so strong and this data set was large enough for it not to matter. Nevertheless, you can sort of see how this is, could possibly cause a problem. Now on the other hand, if I put error term, a normally distributed error term on x, now we've got bigger problems. We're getting a DGP of one plus two x instead of two plus three x. And again, this is the attenuation bias that I alluded to earlier. Typically measurement error results in attenuation bias. The betas are too small, which is bad news if you're trying to reject null hypotheses and well it's just generally bad news if you wanna know what's happening in your data set. But as a practical matter, it could interfere with your ability to draw proper inferences and could result in more false negatives and all that stuff. So now what I'm gonna do is I'm gonna run a formalized simulation where I repeat this process many, many times. I'm just gonna draw a bunch of beta coefficients specifically, I'm gonna draw a thousand different data generating processes. And I'm gonna estimate three models. One that puts the error on x, one puts error on y, and one that doesn't put error on either element of the model. And then I'm gonna summarize the confidence intervals. Well I'm gonna capture the confidence intervals for both the intercept and the, actually I take that back. I'm going to only look at the confidence intervals on the x coefficient. And I'm gonna capture two aspects of that confidence interval. First, whether it covers the true beta. So I'm gonna look at whether the 95% confidence interval covers the true beta coefficient 95% of the time. That's question one. Question two is I'm gonna look at how wide those confidence intervals are. And with 95% coverage, narrower is better. So ci is whether I have beta included in my confidence interval, ciwid is the width of the confidence interval. So I'm gonna do this for a whole bunch of different cases. And then I'm gonna look at what we get. So you can see my simulation is merely running away here. And hopefully I'll be able to get some answers consistent with what I told you. So go, go, go, go, go, go, and I'm done. Okay, so bias ci and ciwd are for the models not including the error term. So going back up to the simulation code, you can see the bias i is beta two minus summary model in the coefficients. This is a model, where's a model right here, y given x. So this is with no measurement error on y or x. And going back down to the results here, what we can see is the average bias is 0.0004. So very, very tiny, pretty much close to zero. We cover the 95, the 95% confidence interval covers the true beta 94% of the time. So very close to the nominal alpha level. And the mean width of the confidence interval is 0.137. That's in the terms, in the units of the original beta. So 0.137 units of slope. Now the y error part of the simulation, which you can see up here at the top of the code just is the same thing, except when we run the model we've added a normally distributed error component to y. This is y plus r norm 100, blah, blah, blah, blah. Standard error three. So when I run this y error, these three y error measurements what we should expect to see is that first, still we should have a bias of near zero and we do, 0.001, it's slightly more bias but probably not enough to worry about. The 95% CI covers the true beta 94.3% of the time. So that's good news. Again, not much different from the model with no measurement error at all. But the width of the confidence intervals is greater, 0.2471 as opposed to 0.1379. So our confidence intervals are consistently wider when we have error on the y component on our dependent variable. In other words, we're gonna get less confidence in our estimates. We're gonna be able to say less about the relationship between x and y if we measure y with error. But if we measure x with error and you can see this x error term is just that same stuff except now we're putting error terms on a normally distributed error term on x instead of y. It creates this x, x thing here which is just x plus some error. Now we've got some issues. So coming down here, you can see that they mean bias is 0.413. That means on average we are overestimating the beta coefficient or I'm sorry, that we're underestimating the beta coefficient by 0.41. So you can see the bias is calculated by beta minus the estimate. So the estimate being smaller means that this bias term is gonna come out positive. So we're underestimating beta by about 0.41. That's the attenuation bias to which I referred and it's bad. The confidence intervals, 95% confidence intervals cover the true beta 1% of the time. That's terrible. So the 95% CIs are not an accurate reflection of what's going on at all and the confidence intervals are super wide. 0.3446, that's a very wide confidence interval. Why are you in the error on the models that had the error on the dependent variable? So we're getting confidence intervals that are too wide, downward bias and don't cover the true beta. That's the trifecta of not good. So measurement error on X is a really bad thing. And bigger N doesn't solve the problem. So all these models you can see up here in the code are run on samples of size 100. You can see X is drawn out of the uniform from 100 samples of the uniform distribution. Supposed to be bumped that up to 1000. This is data sets of size of 1000. So if I just rerun this exact code, doing nothing but changing the sample size to 1000. In fact, I'm even using the same seed for the random number generator so that the random component of the simulations is the same just to open the samples. Here are all the reports of my results. You can see that for the Y error and no error models I'm basically getting the same answers, very little bias, very narrow confidence intervals. 95% coverage is really 95% coverage. But I want you to focus on these X error terms because that's where we had the biggest problems before. And what you can see is I've still got 0.4255 underestimate bias. So in other words, I'm supposed to be getting, see the true betas here. Where are the true betas? That's true betas are, well, here are the true betas. I should just be able to probably be. Yeah, 1.877 and 1.36. And on average, I'm underestimating this X beta, this is the second one, by 0.4255. That's a little under 50% underestimation, so bad. I'm not covering it at all with my 95% CIs. And again, my confidence intervals are about twice as wide as the confidence intervals with no error at all. Actually, they're more than twice as wide as the confidence intervals coming out of the model with no error at all. So simply collecting tons and tons of data will not fix a measurement error problem, which is a sad but important to know. So now comes the inevitable question, what do we do about measurement error as a source of problems for our regression? And of course, there's the simple answer, which is, don't make measurement errors. Okay, that's fine. One could say, yeah, you know, if I'd have done it, I just wouldn't have made the errors. Fine, but very frequently, we actually aren't in control of the data collection process to begin with. We may be using secondary data, some of their author's data. You may be trying to replicate or do an original analysis on pre-collected data. For all these reasons and others I can't think of at the moment, often just saying make fewer errors is not really a viable option. Although it's certainly good to think about whenever you are engaged in original data collection, although even when you are, probably some level of error is inevitable. Given that just saying don't do it is not necessarily the most helpful solution, we could also say, well, given that it's inevitable, we can maybe do some things to mitigate the problem. And one of the biggest things we can do is try collecting multiple measurements. So what do I mean by collecting multiple measurements? Well, it's exactly what it sounds like. Each of these individual measures I'm thinking about are all flawed, maybe. But together, they're actually, we can get more out of them than if we give out any single flawed measure. So consider, for example, a simple mean of M many measurements of the same concept. And I'm just gonna put them on the same scale for the moment. One could rescale them or standardize them to sort of force them all to be on the same scale if you want to, which is probably gonna be necessary if you wanna take a mean of these M many measurements. According to the CLT, the central limit theorem, the probability limit as M goes to infinity, that's M, not N, so M is the number of measurements, not the size of the sample. The P limit as M goes to infinity of the variance of one over M sum from I equals one to M, or cap M, so M is the total, or no, no, no, it's little M, sorry. I equals one to little M. X I equals zero. So if I have M many measurements and they're all flawed and I just take a simple average of those M measurements, the more and more measurements I get, the variance in those measurements goes to zero. I get smaller and smaller and smaller and eventually it goes to zero. Furthermore, where am I? So P limit M goes to infinity of one over M sum from I equals one to M X I equals Mu I equals X I, zero. So the mean of M many flawed estimates will approach the true value of the underlying concept as the number of flawed measurements gets larger and larger and larger. So this is what I'm essentially saying here is if you have crappy measurements, it's good to have a lot of them because I can just average those crappy measurements and assuming that they're all unbiased crappy measurements then the only problem is just noise in the measures. The more and more bad flawed measures I get, the better and better that I can get at extracting the signal, the true value of the underlying measure out of those flawed measures. So you can see that in the simulation I prepared, I'm gonna give myself a little bit more room here so I can show you what I'm doing. So I've got some kind of concept X here and I'm gonna draw 100 samples out of that X and I'm gonna draw 100 observations I should say out of X here. So this is my true X value and then I'm gonna create five noisy measurements. So X1, X2, X3, X4 and X5 are all noisy measurements of X. They're X plus some randomly distributed error with standard deviation, normally distributed error with standard deviation of three. If I run a model, so first of all I should say, oh, the model, the true DGP is Y equals two plus three X plus a normally distributed error term. If I use only one of these measures of X to estimate a model, I'm going to get a biased estimate. So I should be getting two plus three X. If I use only X1 to measure X, I get two and a half plus three X. If I, I'm sorry, I get three, 2.75, two and three quarters plus two and a half X. So I'm getting the attenuation bias. If I use X2, I'm getting two plus 2.7 X. So a little closer to the true value of three X, but not really that great. If I just use X3, I get one and a half plus 2.29 X. So again, drastic attenuation bias there. 2.16 plus 2.39 X. All of these individual models using the bad measures of X are all kind of equally bad. They're all a little bit different in their badness, but they're all attenuated in terms of estimating the true relationship between X and Y. But suppose I bind these X measures together and I calculate a mean of all these bad measures. So what I'm doing is taking these five X measures and extracting a mean for each observation. So, and I'm calling that double X. So double X is the average of these five bad measures. If I run a model of the average of the five bad measures against Y, I'm getting two plus 2.85 X. That's pretty darn good. In fact, if I run a model with the original data, Y given X with no error whatsoever. Sorry, I forgot to type in the LN part. LN, there we go. I get 2.25 plus three X. So I'm actually not doing too much worse than I could do with the true data with no measurement error at all. And the more measures of X I got, the closer and closer my averaged estimate would get to the true data generating process, or at least more accurately, the closer and closer it would get to the model I could get having the true underlying measures of X and Y for any given sample size. So the upshot of all this is, if you have to have bad measures, the best thing to have is a bunch of bad measures. Because then, at least, maybe you can extract a common signal out of them by taking a simple mean. Another approach that we're not really gonna cover in great depth in this class, but I just want to genuflect toward it because it should come up in your later training, is if you have multiple measures of X and they're all flawed, you could think about using factor analysis to extract the common principal component. Or sometimes a factor analysis is called principal components analysis. These two things are, there are some, they're broadly speaking the same thing, but there are lots of sub techniques of how to do it. And sometimes principal components analysis is distinguished from old style factor analysis. Nevertheless, factor analysis or principal components analysis is a way of extracting a common signal out of multiple different measures. In particular, it looks for, if you have K measures, it will try to extract K different principal components or signals out of that combination of K measures. And if there's only really, if you have K measures and there's only really one signal, what you'll see is that the first principal component will be very dominant. It'll be, it'll sort of dominate, it'll be the strongest signal you can extract from these K measures. And all the other ones are kind of be just noise. So without going into great depth about what we're talking about here, because it's not really OLS material or linear model material, it might be important to look into the idea of factor analysis if you had many measurements and you suspected that they were all measures of the same thing, but that each one was individually noisy. So something to think about if you have an applied project. Now I wanna talk about the second problem that I pointed out for today, indigeneity. And indigeneity is pretty easy to explain and pretty hard to fix. Well, let's start off with just explaining what indigeneity is. So suppose we've got a model and we've up to this point kind of been assuming the following model between X and Y. X causes Y. And there's also this other component U that is involved in Y. And these two things combine X and U combined to produce Y. That's all, you know, hunky-dory, fine. That's what an ordinary linear model kind of looks like when these causal arrows are linear. But indigeneity says that this model is more complicated than this. So what I'm gonna do is sort of list this here. Oops. We've still got X causing Y, but now Y causes X as well. So for example, it could be the case that economic policy influences the state of growth, but the state of economic growth also influences economic policy. So if we want to figure out why or how policies affect growth, we wanna focus on the part of the arrow going from X to Y and ignore the part going from Y to X. And this can get pretty confusing because especially when these things, it doesn't really matter how it goes. If these things, if X causes Y positively and Y causes X negatively, you can get a canceling effect where it looks like there's no relationship even though there's a strong one. If they go in the same direction, so X causes Y positively and Y causes X positively, you can end up believing that X causes Y really, really strongly even though it causes it less strongly, there's just sort of something else going on as well. And of course, it's common for there to be sort of other things going on in such a model, so I'm actually gonna go back to that red for a moment. So I'm gonna call this X one and we might have something like X two causing Y. We might have another variable out here, X four causing X one. There might be some clean causal relationship between X three and Y over here. And what we wanna know is how do we estimate a model that recovers the relationship between X and Y or X one and Y in this case? When we suspect there's endogeneity going on and there's probably other things going on too, like we need to control for X two in this case, we might wanna include X three because it's relevant and so on. And in fact, it's gonna turn out that what we really want is this thing right here. We want this X four, which causes X one, but doesn't cause Y. That's gonna be of critical importance to us. But before we get to that, we should ask, okay, this is what endogeneity is. Why do we care? Do we care? What are the consequences of endogeneity? I alluded to them before and the short answer is, yes we care, we care a lot and we need to know what to do about it. So we care about endogeneity because it causes bias in our estimates of data. Estimates of the relationship between X one and Y will be biased in the presence of reverse causality between Y and X one. Let me just write that down for you. Estimates of the relationship between X one and Y in that where X one and Y are defined as in that previous causal diagram will be biased in the presence of reverse causality between Y. Let me show you why this is true. So I'm going to start off with a real simple model. Y equals X beta plus U zero. And what I'm going to do is say, okay, suppose X, which is just a unitary variable or a single variable in this model, suppose X is a function of Y and in particular suppose X equals alpha zero plus alpha one Y plus alpha two Z plus V. So this model here corresponds to Y causes X. X also causes Y and Z causes X too. Z is related to X as well. So and then we need to add on the error term here. Well, if we sort of start distributing terms A zero beta plus A one beta Y plus A two Z beta plus V beta plus U zero. What we've got here is a composite error term. So V is the error in how Z is related to Y and, I'm sorry, how X is related to Y and Z. This, so V is in other words unmeasured or unmeasurable in some way. And so this becomes our composite error term U. And that means that the regression of Y equals X beta plus U zero involves a composite error term U right here. And U is going to be correlated with beta. So actually, I'm sorry, it's going to be correlated through aggressors X. So the CLRM assumes that the expectation of U given X is zero. That's what we need for unbiasedness. If you were going to go back to that proof and see that. Well, in this case, what we've got is the expectation of U given X equals the expectation of U zero plus V beta given X. And what's X? X is A zero plus A one Y plus A two Z plus V. The expectation of V beta given V is V beta. So this ends up being V beta, which is not zero. Which means that we do not have, by definition, this assumption of the CLRM is not true. Which implies that coefficients beta will be biased. And at this point, just to make a bit of a meta point, this is kind of why we spend all that time going through those proofs in the beginning of the class. The proofs are necessary because they allow us to rigorously establish the conditions under which all these qualities we assigned to OLS are true. So when is beta unbiased? When is beta an accurate estimate of the underlying DGP even when it's mis-specified and all that? That allows us to lay out the assumptions we need to make in order for those to be true. Which in turn allows us to figure out what violations of those assumptions or what problems we encounter will make these results untrue. So very often, what we find is that some problem we encounter, like indigeneity or measurement error, causes there to be correlation in the error component of the regression and the regressor's X. Which automatically implies, even under the best conditions of a properly specified DGP, that the coefficients beta will be biased. So in short, we have a problem with indigeneity and it's a very significant problem because it's a bias problem, not merely an efficiency problem. So what do you do if you have an indigeneity problem? Well, there are several things one can do. I'm gonna spend just a little bit of time talking about a quick and dirty fix that some people apply when they have time series cross-sectional data. So they have multiple observations over multiple times or even if they just have a time series data set one unit observed over multiple times. So if at time T, Y, T equals X, T plus U, but we suspect indigeneity between X and Y. What we might do is instead of running this model, we might run this model instead. Y, T equals X, T minus one, so at X at the previous time period, plus U. Why would we do that? Well, we would do that for the following reason. If we expect that X, T could cause Y, T and that Y, T could cause X, T, what we might say is, well, if it's the case that X, T minus one is closely related to X, T, then I use X, T minus one as a proxy measure for X, T that is in turn not itself correlated with Y, T because Y, T causes X, T, but it does not cause X, T minus one because causality cannot flow backwards through time. So in other words, this link does not exist because of temporal, basically things can't go backwards through time. So that's a quick and dirty fix that sometimes some people apply. The trick is that it actually can be the case that it could be plausible that Y, T minus one, I'm sorry, that Y, T could in fact, quote unquote, cause X, T minus one. How could that be the case? Well, it could be the case for the following reason. So obviously this link can't exist directly in the sense that Y, T cannot cause X, T minus one due to restrictions on time flow, but it could be the case that if I anticipate I as an actor anticipate Y is gonna be at some level at time T, I might take preparatory measures to anticipate that change in Y. So for example, if I think there's going to be a war the dependent variable being war, next year, I might build up my military stocks in the previous year to get ready for that war. That does not mean that my increased military stocks cause the war. It means that my anticipation of the war cause the increased military stocks. So in fact, there can be links between Y, T and X, T minus one and this little correction won't work. So this will not work. Unless we can strictly break Y, T has no relationship between X, T minus one. And that often fails in the presence of strategic anticipation. So this is a correction for endogeneity that is very often applied and not done. It's not a bad idea, but you need to think carefully about the conditions under which it can work. And in particular, you need to rule out the possibility the rule of possibility of strategic anticipation of future actions if you're going to think about applying this correction. So that's something sort of quick and dirty you can do. And it works under some conditions, it does work. But now what do we do if we don't want to use that quick and dirty fix? For example, if it's a bad idea. Well, you can use a procedure that goes, that's called a two-stage least squares or 2SLS. It's sometimes called instrumental variables regression. What is 2SLS? Well, 2SLS is an approach to modeling that corrects for endogeneity using instrumental variables. An instrumental variable is a variable that enables us to break the link between Y and X, but preserves the link between X and Y. In other words, it isolates one direction of the causal arrow. The problem in the previous proof, so the problem is that X and U are correlated and that causes bias in beta and we need to stop that from happening. So we need to rid ourselves, we must rid ourselves of the portion of X that is correlated, that's the goal. And under the right circumstances, we can do this. And those circumstances are, the circumstances under which 2SLS will work. So let's talk about what 2SLS is and how it works. So step one on 2SLS, predict X using a variable or many variables that are predicting A variables. Nice work, moving on. Using variables that are correlated with X, but not with U, which is to say not with Y. So let's talk about what 2SLS is. Not with Y. So find a variable that predicts X and doesn't predict Y. That variable is called an instrumental variable. So this is your instrumental variable. So what we've got is X hat equals gamma zero hat plus gamma one hat Z plus V hat. That's the stage one regression. And Z is the instrumental variable. And I should say the estimates of X hat are unbiased because Z is uncorrelated with Y, a predictor of X by definition. So in other words, let me state this in a slightly different way. You may have noticed there's an omitted variable here. The omitted variable is Y. We know because we're dealing with this problem that Y is a predictor of X. So omitted variable bias would apply if we left out Y, you might think. But that's only true if Y is correlated with Z. As we already discussed, if Y is not correlated with Z, omitting Y is not harmful for our estimate of the relationship between X and Z and even more to the point, our prediction of X hat is still an unbiased predictor of X as long as there's no correlation between Z and the omitted variable Y. So omitted variable bias is not a problem in this case because we've chosen Z specifically to be a predictor of X and not a predictor of Y. All right, step two. Use X hat in place of X in the model of Y. So what we've got here is Y equals X beta plus U0. So Y is X hat plus V hat beta plus U0. So in other words, I've just taken the X, the total component of X, and I've partitioned it into X hat, which is the part of X we can predict using our step one, our stage one model, and V hat, which is everything else left over. That's X hat beta plus V hat beta plus U0. So this V hat beta, which we basically calculated from the previous regression, gets wrapped up into the error term. So we've got X hat beta plus U, and U is a combination of V hat beta and U0. And X hat and V hat, or I'm sorry, X hat and I should say U, are uncorrelated. Now you might want to think to yourself for a second, uncorrelation, nice work, uncorrelated. How do I know right now that X hat and U, which is a combination of V hat beta and U0, how do I know that those things are uncorrelated? I know they're uncorrelated because of a previous proof that we talked about in one of our earlier classes. So I'm going to move this down a little bit. You may recall that in any regression, the predicted DV, so in other words, PXY and the predicted residuals, MXY, are by definition, by construction, uncorrelated. So U is a combination of V hat beta. So where am I here? This is the first component of U. It by construction is uncorrelated with X hat at all. We've made it so by virtue of regression. U0 is the original component of this model up here, and as long as we can assume that it's independently and identically distributed, the normal CLRM assumptions, it too will be uncorrelated with X. So I've got a model now where U and X are uncorrelated and I can invoke the CLRM proofs as I did before. I'm home free. If I can implement this two-stage least squares procedure, I have broken the correlation between X and U and I no longer need to worry about that correlation as a source of bias problems. So that's good. Sometimes 2SLS, in fact I would say often, 2SLS is implemented in the presence of a more complicated model than a simple model between X and Y and Z. In other words, we often have more than one predictor of Y. We often have more than one instrumental variable and so on. So I'm going to just write a little bit more general version of this thing I've already written. So let's say Y equals beta 0 plus beta 1X plus beta 2Z plus beta 3D plus U. I'm going to say X is alpha 0 plus alpha 1Y plus alpha 2Z plus alpha 3F plus V. So D and F are instruments for Y and X, respectively. Z is a set of exogenous variables that influence X and Y space here. So when instrumenting X, when running the first stage regression, the common procedure is to use all exogenous variables, which is to say Z and D and F in that model. So I want to use X hat equals gamma 0 hat plus gamma 1Z hat or gamma 1 hat Z plus gamma 2 hat D plus gamma 3 hat F plus V hat. So I'm putting in everything in any model that I think is exogenous as a predictor of X. Why am I doing that? Well, the reason I'm doing that is because I actually don't care about the estimates of gamma hat so much. I'm not trying to get a proper specification of X hat. All I care about is that I get an accurate estimate of X hat. And it turns out that as long as these variables are exogenous, it improves my estimate of X hat without exacerbating the endogeneity problem, and thus the performance of my estimates is improved. So now what I want to do is show you an applied example of this in R and give you a sense of how this really works. So I've got an R script here that creates an endogenous dataset. Constructing an endogenous dataset is actually a bit of an involved procedure. Oh, it seems that we're going to need the SEM or Simultaneous Equation Modeling Package, which I don't currently have installed in this particular machine. So you're going to want to go in and install the SEM package with all its dependencies before we get started. That should only take just a second. There we go. Try it again. Bam. Everything's great. Creating an endogenous fake dataset is actually kind of hard because endogeneity implies a system of simultaneous equations that have to be solved for every single observation in the presence of error. So I don't want to exactly belabor how I did this. All I want to say is I've got a dataset of size 500, and I constructed that dataset so that the exogenous variable Z is a predictor of X with a beta of 1. X predicts Y with a beta of 1. Y predicts X with a beta of 2. The intercept on the Y model is 3, and the intercept on the X model is 4. So in other words, I can write this model down. Y equals 3 plus X, and X equals Z plus 2Y plus... Well, I guess I put the intercept first. 4 plus Z plus 2Y. That's what the data generating process looks like. So I've generated my data here, and if I try a normal model, I'm going to get highly biased results. In particular, you can see that the Y intercept is way off here, and the estimate of the relationship between X and Y is off 2. In particular, my estimate of the beta on X is 0.7. That's a bad estimate. On the other hand, if I come in and do a two-stage least squares model of X, so here I'm actually explicitly saying TSLS Y given X, model Z. So what this is doing is, TSLS Y given X tells me that the model I'm interested in is the relationship between Y and X. Comma squiggle Z tells me, aha, the issue here is I want to use an instrumental variable Z as a predictor of X. So if I come in and run this model, which is in the SEM package, what I get is 3 plus 1X, which is exactly what I should have gotten given what I told you about the data generating process. That's good. That means the instrumental variable is appropriately recovering the underlying EGP. Now, suppose I have lots and lots of variables. So now what I'm going to do is create two instrumental variables for X, two instrumental variables for Y, a common exogenous variable Z that predicts both X and Y, and so I've got a really sort of complicated model here. What I've got is, to write this model out, I've got Y is a function of, is a function of X, M, N, and Z. X is a function of Y, K, L, and Z. Okay? So I'm going to run all this junk here, and I'm going to create my dataset using arcane methods that, I mean, we could discuss it at length if you like privately, but I just want to sort of spare everyone. It is kind of involved to actually get this going. And if I run a standard RLS model, just trying to predict Y with a bunch of stuff that I know, I'm going to get a biased estimate of X. In particular, as you can see here, X equals 0.87. It should equal one, as in the previous model. I've just added new stuff to the existing model. The appropriate procedure, as I've described it, is to put everything, all exogenous variables, as instruments of X. So if I just, for example, try a two-stage least squares model using only K as an instrument of X, you can see here that if I only include one instrumental variable for X, it actually chokes that the routine seems not to work at all. It's reporting some kind of problem with the leading principal minors. And that goes for all of these other models that I've tried to run, whoops, where I'm not including the full suite of instruments. But if I run a two-stage least squares model where I include all the exogenous variables, K, L, M, N, and Z, well, geez, I get good answers. And in fact, I recover the coefficient of X, which is one, as we know in the 2J DNA process, because I told you that was so, pretty well. So this sort of shows you that you really need to include all of the exogenous variables, the instruments for X, the instruments for Y, and any common exogenous variables in the first stage of the TSLS regression if you're going to implement this 2SLS, sorry, not TSLS, 2SLS, if you're going to implement this 2SLS procedure. So you may be thinking to yourself at this point, boy, I'm really excited. I had this magic bullet that's going to solve all my endogeneity problems. Well, not so fast. 2SLS is, it turns out, a troublesome procedure in the sense that it's very finicky and a lot of things have to go right in order for it, first of all, to even work. And secondly, for it to work in a way that's not going to do abuse to the underlying data-generating process. So let's talk about some of the practical difficulties, some of the many practical difficulties implementing 2SLS as a procedure. The first practical difficulty is just in finding an instrument. Remember that a good instrument has to be correlated, so a good instrument, I'm going here, a good instrument is correlated with X and not with Y. That is, it turns out, a very tall order to meet in most applied data sets where everything is kind of related to everything in some plausible way in most cases. This is so challenging to do, in fact, that debates have erupted in the substantive literature over the appropriateness of instruments used in 2SLS analysis. So as an example, Asimoglu and Johnson in the 2005 Journal of Political Economy use colonial settler mortality rates to proxy for property rights institutions when modeling their relationship to economic growth. So let me just... They make an argument that settler mortality rates are correlated with property rights institutions that have been created for historical and sort of developmental reasons, but those same settler mortality rates are not correlated with economic growth in the present day. Whether you think that's true or not, they make a little argument for it. It's a very weird thing to do in the sense that you're taking really old data trying to relate it to how institutions evolved over time and saying, you know, I think this relationship is strong enough so that I can use this ancient data, piece of data as an instrumental variable, and it's ideal because it's not related to any economic growth things that are going on today. You get these kind of weird arguments being made precisely because it's so hard to find anything that's correlated with the regressor and not with the dependent variable. So if you ever use this, be prepared to defend, first of all to look in interesting places to find these instrumental variables, and secondly to defend the operationalization that you make once you implement this procedure. Just to make everything even better, even more welcoming, there is no way to test whether this is true. It is impossible, at least as the procedure is currently understood, to implement a test to see whether an instrument is correlated with X and not with Y. You might think to yourself, well, sure, I can just run a model where I say take an instrument of variable Z, so this is my proposed instrument, and do an LM of X and Y. Well, yeah, you could do that, but X and Y are collinear, highly collinear, they're endogenously related. And so just sort of running this test is not going to give you a good answer because the correlation, X and Y are related enough such that we would expect both of them to be decent predictors of Z, even if the underlying relationship between Y and Z is actually zero, and Z doesn't actually predict Y. I mean, that sort of goes double for, you might think of this, I could just run Z and X, and hopefully the relationship between Z and Y will be zero. Well, just because the relationship between Z and Y is zero in that model doesn't mean it's not a predictor. Maybe X is just absorbing all the correlation because X and Z are supposed to be collinear, they're designed to be collinear. So there's really no way to even figure out whether you've got a good instrument. And just to cap it all off, if you have a weak instrument, so weakness of an instrument, a weak instrument is one where the correlation between Z and X, where Z is the instrument, is small. The weaker the instrument is, the greater the variation in estimates, the greater the standard error in the betas that you derive out of the second stage of the procedure. So in other words, a weak instrument is going to give you highly variable results that might not be so great. Here's a case, for example, where I've created a data set where the relationship between Z and X, which so Z is the instrument, the thing we're instrumenting, the beta is only 0.06. So instead of what used to be one, now it's 0.06, which is a tiny correlation between Z and X. If I try to implement the two-stage least squares procedure, it'll work, but you can see that my estimate of X is still, you know, it's bad. I mean, it's just not so good. My original model is telling me X has a relationship of 0.59. We know it's supposed to be 1. So in other words, we're getting attenuation bias in this particular case. But the endogeneity correction is telling me the relationship between X and Y is 3. So in this case, it's too big. Now, in repeated sampling, it would be unbiased. We'd get, you know, estimates on average hitting right around 1 for that beta between X and Y. But in any particular sample, the variations are going to be so great that you'll be getting terrible answers. So if you've got a, if you've got a sort of poorly, a poor instrument, you've got a problem in the sense that, yes, your estimates are technically unbiased, but they're not going to be so great. They're not going to be especially useful. So there's that. Even worse, you have to have at least one variable that causes X and not Y in order to make this work. So an unidentified model is one in which, let me go in here. Get my pen. An unidentified model is one in which there are fewer instruments than endogenous variables. So you need to have at least as many instruments as you have endogenous variables if you want this procedure to actually work. If, for example, we suspect that, you know, W, X, and Z are all endogenously related to Y, we're going to need a separate instrument for each one of those endogenous variables in order to make TSLS procedure work. And if we don't, we're going to have an unidentified model, and in short, we're not going to be able to derive meaningful beta estimates out of that model. A just-identified model has exactly one instrument per endogenous variable. That model will run, but as we already established in our little sample R script, more instruments is better than less in terms of getting accurate estimates of the second stage beta, in other words, accurate estimates of the relationship between X and Y. What we ideally kind of want is what's being labeled here as an over-identified model which is more than one instrument per endogenous variable. And this over-identification, so to speak, improves the performance of the estimator by improving the performance of X hat. So in other words, we get a better estimate of X hat, which enables us to, in turn, get a better estimate of the relationship of regression. But as I've already told you, it's really hard to find instruments. It's even harder to find more than one, even though that's kind of how it needs to work in order to get really good estimates. So what I'm sort of telling you here is that TSLS is, in theory, a good, reliable procedure for handling endogeneity. But in practice, it relies on your ability to do some things that are quite hard to do in an applied data set. And if you try to do it anyway and you have one weak instrument for your endogenous variable, you could end up exacerbating the variance of your estimates so much that it perhaps might be better to just accept the endogeneity bias and live with it. In other words, the cure can actually be worse than the disease in some cases. A couple of other things to think about. If R squared of the first stage regression if R squared of the first stage regression is high, then two SLS estimates will be close to the OLS estimates. And this is the case because X hat and X so I should say because the X hat estimates that come out of the first stage will be very close to the observed values of X. Remember that R squared is the proportion of variance in a variable that's reasonably attributable or explainable by the model. If we're explaining and predicting almost all of X with our instrumental model, then we're basically just putting in X again. We're a very slightly modified version of X. That's good if V is small. Or it's good if that means that there's not a lot of error to be correlated with X in the naive OLS model. The bias in our beta hat attributable to endogeneity, if you go back to the proof is proportional to beta V. Endogeneity bias is small and there's not much of a problem to solve to begin with. But this is very bad if we've got an overfitted model. So this is bad if X hat is overfitted. Remember that the point of 2SLS is to get rid of the portion of X that is caused by Y. Putting in a bunch of junk in the first stage regression, just to predict X really well is not going to work if it gets us back the components of X that are with Y to begin with. So including too many instruments is kind of a problem too. Because if you include too many of them and they're not truly instruments in the sense they're not truly correlated with X and not with Y, you might end up accidentally predicting the X hat variable just due to the fact that junk sometimes gets lucky and is able to predict a variable. So you just get back the components of X that are affected by endogeneity so your second stage instruments are bad anyway. So I just told you that you really want a good model of X at the first stage, but it needs to be a good model and you can't just assess its goodness by looking at its R squared because a higher R squared could be diagnostic of overfitting or it could be diagnostic of the fact that the endogeneity problem wasn't that bad to begin with. So this is a pretty bad situation as it goes because there are all sorts of things that can go wrong. Our normal ways of diagnosing these problems don't really work and you're sort of trying to navigate between the Scylla and Charybdis of overfitting and under-specification respectively with no guidance as to whether you've done it well except for theoretical argumentation. So this is the perfect storm of bad things. And then in case you needed any more evidence that 2SLS is a problematic procedure it is consistent. So 2SLS is asymptotically valid not valid in small samples which means N-K where cap N is the number of observations and K is the number of variables needs to be very large in order for this procedure to work. So if you tried to implement 2SLS with a bunch of predictor variables in a small dataset of size 50 or 60 it may not even work anyway. None of this stuff may work anyway it may all fall apart. So you need a large sample an ideal you need to find instruments hopefully you need to find a lot of them but they truly need to be good instruments you've got to have the specification of the first stage correct and the specification of the second stage correct all these problems in addition to all the other problems that we need to think about in an OLS framework such as heteroscedasticity we still need to think about that multicollinearity all these other problems are still things we need to think about So 2SLS is is the methodological equivalent of fine china beautiful elegant not always the most practical thing you wouldn't want to take fine china camping is that it's often more analogous to camping than an elegant dinner but nevertheless it's what we have and as a wise man once said you go to war with the statistical models you have not the statistical models you wish to have So 2SLS is one way of handling endogeneity as far as it goes good luck That's it for this week next week I'll see you next week Thanks a lot