 Hi, this is Dr. Justin History, and this is week seven of PolySci 509, the linear model. And today we're going to talk about what happens when things in your regression start to go wrong. For the last few weeks we've been talking about building up this ordinary least squares regression model under a bunch of assumptions where basically either we know what's going on or we are happy that our model has a sufficient approximation of the underlying data generating process to be useful. And today what we're going to talk about is what you do when you know or you discover that certain things about your regression are not an accurate reflection of the underlying data generating process and what you can do to make OLS, to recover OLS as a useful tool in these situations. So I just want to start off by talking a little bit about what we've done in the last couple of weeks. We talked a lot about the classical linear normal regression model, or CLNRM, in those weeks. And the CLNRM is a collection of assumptions that we make about the data generating process that enables us to derive results about the performance and usefulness of ordinary least squares regression, but I'm sure you were thinking, as I was telling you about them and as you were reading about them, that these assumptions are rarely met in practice, at least not perfectly. The CLNRM assumptions are not chosen because they are mirror-like reflections of the world, but rather because they are useful simplifications that enable us to derive interesting implications from the OLS model. Now we want to talk about what happens when we discover certain violations from those assumptions and what we can do about that. And we've actually talked about this issue a little bit already. We've talked about what happens, for example, when the end part of the CLNRM is violated. So you may remember that the end part stands for normal or normality. And what we're referring to there is the normality of U, of the error term. A couple of weeks ago, we talked about, well, what happens when the errors are not normal? Well, the answer is it depends, as the answer in statistics is usually it depends, just in most things in life, I suppose. If you have a really, really large sample, the implications of N being violated are very small because the central limit theorem, as we learned, ensures that the beta coefficients in an OLS model are distributed normally regardless of the underlying distribution of the error term. And so we sort of get all the same results and test statistics that fall out regardless of the distribution of the error terms. But in small samples, we can have a significant problem because the T and F statistics sampling distributions are derived from the assumption that U is normally distributed. And if U is not normally distributed, then all of those proofs about T statistics and F statistics go out the window and we have no idea how the T statistic and F statistic are distributed, which means we have no way of constructing test statistics, which means we have no way of constructing confidence intervals or conducting hypothesis tests. And so in a small sample, if you don't have normal errors and you say something like, oh, the P value for this T test tells me is a point 0.02, which means that under the null hypothesis, we would see a result of this magnitude 2% of the time. That's just not true in a small sample unless the errors are normally distributed. In that case, we sort of weaseled out of the problem by saying, well, we're not on a large sample where large is hard to define. But let's say greater than 60 or 70 observations, maybe 100 observations are going to be safe. This doesn't matter. And what we're going to do today is basically repeat that basic pattern, but for other problems. We're going to say, here's a problem. Here are some consequences of that problem. And here's a way maybe we can weasel out of it. Some problems are easier to weasel out of than others, as you'll see. Some of the solutions to problems are more technical. Some are less technical. And by the end of this lecture, what you should have is a good grasp of the problems you may encounter as you are doing regression analysis and what you can do about them. So in particular today, we're going to look at three violations of the CLNRM model. One is called multi-colonarity. And this is so one thing about getting a PhD is never use one word when five words will do. Or you can use one really long word. So multi-colonarity is just a PhD level word for what you call when the regressors are correlated. So your regressors x are correlated with each other. Correlated with each other. And the problem with multi-colonarity is that it violates the assumption of a non-stochastic x. So if you recall, one of the assumptions we made when we were doing our proofs about OLS was that x was non-stochastic. If the regressors are correlated with each other, then obviously that must mean, first of all, they're random. And secondly, so they're stochastic. Secondly, that we did a proof where we relaxed the assumption of non-stochasticity, but we still assumed in that proof that the x's were not correlated with each other. Now we're assuming that they are. This problem is probably at once the most common. And in some ways, the hardest to fix, because its consequences can be severe, but only in the worst cases of multi-colonarity. And unfortunately, one of the only fixes for multi-colonarity is just to get more data or maybe rethink the way you're conceptualizing your project. But we'll talk about that more in depth in just a minute. The second problem is omitted variable bias. So this is a case where it's a version of a mis-specification problem. Your model is mis-specified. In particular, it tells you or it's a problem where one or more elements of the regressor matrix x are missing. So the basic idea here is you've got some kind of model of a dependent variable y. You're predicting it with a bunch of x variables. And there's other stuff in there in that x matrix that you're leaving out. Now, obviously, very few people do this intentionally. We may be omitting things because we simply don't know they're related to y. We may also be omitting them because they're very, very hard to measure or collect. We may be omitting them because we think our measurements are so bad that putting them in may make it worse. But no matter why we do it, the effect is the same, we violate the assumption of correct specification. So on all our proofs from a few weeks ago, we assumed as a part of that a specific k, can't spell, specific. We assumed that the model was correctly specified. And that was very important at many steps in our proof, particularly when we were substituting things in and out. If we have omitted variable ys, you can't assume that anymore. And so we'll talk about that in a second. And then finally, a heteroscedasticity. This is, again, one of those PhD words for something that probably could be explained more easily with simpler words. This is just a case where the error terms u do not have constant variance sigma squared. That was an assumption that often came up when we were talking about properties of test statistics. We often said, well, we're going to assume that u is independently and identically distributed with common variance sigma squared, which led us to say that it eventually enabled us to say that the VCV of beta is x transpose x inverse sigma hat squared, which is x transpose x inverse 1 over n minus k u hat transpose u hat. If we violate this assumption, all it means is that the error terms don't have constant variance. And thus, all the proofs that led up to this result no longer apply. That's going to cause some significant problems when we go to construct test statistics. All the test statistics we constructed are going to be, in some ways, lacking or inappropriate. And thus, our hypothesis tests are going to be compromised. They could be underconfident or overconfident. That is to say, the confidence intervals could be either too wide or too narrow. Heteroscedasticity is, I think it's fair to say, probably the problem in OLS over which the most ink has been spilled. This is partially because it's the problem that has the most methodologically oriented solutions to it. There are lots of things that one can do to fix heteroscedasticity methodologically. And so this is probably also going to be the most technical part of today's talk, just because there's lots of ways of thinking about it that are technically oriented. So what we're going to do is go through these problems in greater depth. And we're going to ask, what happens when these assumptions are violated? And then, if these assumptions are violated, what can we do to fix the problems created? And like I said, sometimes there's going to be a really technical thing we can do. Other times, we just have to sort of live with it or maybe try to get out of it by rethinking your research design or collecting more data. All right, so let's talk a little bit about multi-colinearity. So like I just said, we're going to first talk about what happens when the independent variables in a regression are multi-colinear. The short answer is that when you have two predictor variables, x1 and x2, that are closely related to one another, it's really hard to figure out how each separately contributes to the prediction of y. And I've got a little diagram here that illustrates this. So I've got two x variables, x1 and x2, and they're highly correlated with one another. And they're also correlated with y. And you can see, actually, I can even go into r and reproduce this for you. So if I go in and let's see, create this 3D object. So first of all, here on the screen is x1 and x2, the predictor variables. And you can see they're generated to be very, very closely correlated with one another. And if I go into my RGL object, you can see I can rotate this cube really nicely. You can see that we can't really figure out whether x1 or how x1 or x2 causes y. We just know that they both cause y. There's almost a straight line being drawn through this space. What this means in terms of regression analysis is that the estimated standard errors on beta will be larger than normal, but beta will be unbiased. So estimated SEs for beta hat will be large, but beta hat itself is still unbiased. That's the short answer. And the severity of that particular problem depends on how severe the multicollinearity is in this particular data, in this particular sample. So here I've got a little illustration where I've reduced the level of multicollinearity between x1 and x2. And you can see that now there's much more of a planar surface that we can pick out of this regression. So if I go back into R and rerun this before now, let me actually just talk a little bit about what I've done here. So x is drawn from the multivariate normal distribution, as you can see right here. And the sigma matrix, which is the variance-covariance matrix, I've given each of the variables a variance of 2 and a covariance of 1.98, which means that they're extremely highly correlated with each other. But let's say I keep the variance of 2 and reduce the covariance to 1 and 1 half, which will reduce the correlation between x quite a bit. So if I go in and run this plot 3D thing again, you can see, first of all, that x1 and x2 are now much less correlated than they were. And I believe I can even tell you core x, yeah. Now they are now correlated at 0.71, whereas previously they were correlated, well, I can even just go back and do it again. They were correlated very, very highly. So core, well, actually I should probably restart the counter entirely. Here we go. So if I redo this all and then do a correlation on x, they were previously correlated at 0.989, which is a very high level of multicollinearity. But now if I come in and do it again with the lower covariance, now they're correlated at 0.71. So going back to this 3D plot, you can see, ah-ha, now I can actually pick out a much more of a flat plane in this space. That's good news, because that means that regression is going to have a much easier time independently figuring out the contributions of x1 and x2 to y, to that data-generating process. And the problem becomes even better, or even less significant if we reduce the level of multicollinearity even further. So now I'm reducing the covariance to 1. This means that x is correlated at 0.52. You can see here's a picture of the correlation between x1 and x2. It's very small. And now we're getting a very good plane floating in this x1, x2, y space. Regression's going to have very little difficulty picking out the coefficients here. And of course, if I set the multicollinearity at 0, then things are going to be very good. No problem whatsoever. So if I go through back, you saw I was running regressions each time here. If I go back and take a closer look at the models, so here's the model I ran predicting x1 and x2. I'm going to expand this a little bit so you can see more of it. I was predicting y with x1 and x2. And the true data-generating process is both variables have coefficients of 2. And what you can see here, when I run this model with very high multicollinearity, so the correlation on x here is back up to 0.98. Regression is having a hard time figuring out how much of the variance in y is explainable by x1 and x2 separately. And what you can see is in this sample regression, it's assigned virtually all of the causal power to x2. It's saying x2 has a coefficient of 3 and 1⁄4. And x1 has a coefficient of point, well, about 3⁄4. That's not a mistake that they add up to 4. And it's also saying that x1 is statistically insignificant. So this is exactly the kind of thing we would expect to see in a situation of multicollinearity as high as this. Regression is having a hard time deciding how much of the total coefficient of 4, so to speak, should be partitioned between x1 and x2. Now, in truth, because we generated this data, we know that x1 should have a coefficient of 2 and x2 should have a coefficient of 2. But there's just not enough variation between x1 and x2 for regression to pick that out. But as soon as we lower the level of multicollinearity to 0.71, now regression does much better. You can see it's assigning x1 and x2 coefficients of about 2, 1.8 in each case. And if we lower the level of multicollinearity even further, now it's doing quite well. It's getting coefficients of almost exactly 2. So the short answer to this question is multicollinearity can cause problems. But those problems are not problems of bias, but rather problems of variance. They're going to get the answer right. They just may have a hard time deciding exactly where to put the var, where to assign the causal power, so to speak, how to partition it. And in repeated samples, what that means is that on average, the answers are right, so there's a lot of variation around those answers. So when you have two variables that are extremely multicollinear, you're going to have a hard time separating out which one has an effect on y, and you might end up saying that they both have some kind of joint effect. But you can't decide exactly how that effect should be assigned. There is a longer answer to this question, which is to say that we can prove formally that bias is not influenced by multicollinearity, but the efficiency of the standard errors is influenced. So I just want to briefly talk a little bit about that. So you might recall that the formula for the variance covariance matrix in OLS is sigma squared hat x transpose x inverse, which is equal to 1 over n minus k u hat transpose u hat times x transpose x inverse. Now, suppose we have a really simple two variable model, y equals x1 beta 1 plus x2 beta 2 plus u. This is your basic two variable plus a constant. Well, actually, there's no constant in this model, whatever. Two variable linear model where x1 and x2 are convexors. And we want to know the precision of our estimate of beta 1. So we want to know how precise beta 1 can be estimated, beta 1 hat, that is. As the correlation between x1 and x2 changes. So what we're going to do is isolate beta 1 hat using a projection matrix. So this is the Frisch-Wahl of L theorem trick that I taught you a couple of weeks ago. So m2y equals m2x1 beta 1 plus m2u equals m2x1 beta 1 plus u. So m2 is the projection matrix off of x2. So it's the residual matrix, residual generation matrix for the x2 variables. And so what I'm doing is projecting y off of x2 and x1 off of x2 and the residuals off of x2. And so that gets me m2x1 beta 1 plus u. Because m2x2 cancels itself out, you might recall. m2x2 is 0. The residual matrix is or the residuals are completely orthogonal to the projector variables, which we learned before. But if we use the hats here, the estimates, u hat is by construction orthogonal to m2, as well to x2. So we have no trouble projecting that off. It just becomes u hat again. And so the VCV for this new Frisch-Wahl of L theorem based regression is going to be sigma hat squared x1 transpose m2x1 inverse, where we're just repeating the VCV formula right here. Whoops. Yeah, we're just repeating that formula right there, except instead of x transpose x, now we have x1 transpose m2x1. This is actually x1 transpose m2 transpose m2x1 inverse. But as we learned earlier, this just becomes m2. And so we end up with that. And in expectation, that's an unbiased measure. But what happens here is that as m2x1 gets smaller, right? So think about this substantively. As the residuals from a regression of x1 on x2 get smaller, as x1 gets more and more related to x2, as those two variables get more and more related, their residuals get smaller and smaller, that's going to make this quantity, which is the sort of basic quantity construct standard errors, this here gets smaller. Now that's to the negative 1 power. So that's an inverse, right? It's 1 over that. So 1 over that quantity gets bigger, and hence, the standard errors get bigger. So the covariance matrix that comes out of this regression is going to get bigger, the elements of it are going to get bigger and bigger, as the variation that they have to explain gets smaller and smaller and smaller. So this is like a little bit of a formal proof using some of the previous tools that we've developed that shows you that, hey, multicollinearity is a problem precisely because it reduces the variance to be explained, and the reason it reduces the variance to be explained is because of this little proof. You can see that the elements that go into the construction of the variance go variance matrix shrink. As a practical matter, I don't have a formal proof at this, but practically speaking, this only tends to become a problem when you have very, very high multicollinearity such that this quantity here gets shrunk very, very small. I would say that I tend not to worry about multicollinearity on an order of less than, let's say, 0.9. Anything less than 0.9 is probably not something to worry about. Between 0.9 and 0.95 is somewhat worrisome, and anything over 0.95 is very worrisome because the correlation over 0.95 is going to mean that this quantity here gets shrunk to very, very small levels, which is going to really inflate the standard errors. Now, one question you might be asking is, well, that's kind of interesting. Maybe you think it's interesting. Maybe you don't. But why isn't there any bias generated by this process? Well, the short answer is that if you go back and look at our proof of unbiasedness of OLS from a couple of weeks ago, nothing changes in that proof. Nothing in the unbiasedness proof is changed by multicollinearity. And what that means is that there's no bias problem. This is precisely why those proofs are important to look at. The proofs tell you where these results of OLS come from, and it also allows us to say when we should be worried about those results falling apart. In this case, if you go back and look at that unbiasedness proof, there's nothing in there that is affected by multicollinearity. In fact, we can even go in and say, here's the unbiasedness proof right here that we have from a previous week. Multicollinearity is nowhere to be found in this proof. You can just see we never invoke any assumptions about the correlation in the axis. So looking at that proof, no worries. That's good news. That's a good sign. So what the upshot of this is that what we have here is a so-called efficiency problem. Multicollinearity creates an efficiency problem. The standard errors of beta are going to be too large, but the average estimate of beta hat in repeated samples is going to be correct. And you can see that by taking a look at an R simulation, which we're going to do right now. All right, so let's take a look at these simulations. Now what I'm going to do in these simulations is just repeatedly, in particular 1,000 times, draw a data set of size 50 with a lot of correlation. So this is covariance of 1.98, which, as we already showed, corresponds to a really, really high correlation between x1 and x2. I'm going to construct a linear model of x and y. Or first, I'm going to generate y, just x times beta. The constant in this case is going to be 0. And then I'm going to try to recover those coefficients with a linear model of y on x. And then I'm going to record the accuracy of my beta measures by just looking at whether the 95% confidence interval, confident model, covers the true coefficients, which in this case are both 2. So I'm going to examine whether both the coefficients are covered by the confidence interval, 95% confidence interval. And if my estimates are accurate, the 95% confidence interval should cover the true beta 95% of the time. I'm also going to look at the number of false negatives, which is the proportions of the time that I reject the statistical significance of a coefficient even when I know that it should be statistically significant, even though I know the null hypothesis is false. I've constructed this data, and so I know the null hypothesis is false in this case. Both the coefficients are 2. So they're definitely not 0. So when I run this simulation, you'll see a little text progress bar tells me how far I've gotten. There we go. The 95% confidence interval covers the true beta about 95% of the time. So that's good news. What that means is that the 95%ci is actually still an accurate information. But that ci is going to be wider than it could be. And you can see that in the proportion of false negatives that are generated. 24% of the time, actually closer to 25% of the time, the 95%ci is statistically tells us that the result is statistically insignificant, even though we know for sure in this case that the null hypothesis is false. If I rerun this simulation with a lower level of multicollinearity, what we should expect to see, given the proofs I just showed you, is that the 95%ci will still cover the true beta and 95% of the time. But the false negatives rate will drop. So in other words, we'll get more statistical significance. And you can see that, in fact, that's exactly what happens. All I did in this alternative simulation is drop the level of multicollinearity to 0. There it is. You can see the covariances are 0 in my data set. And what happened was the 95% confidence intervals got narrower, so that I was able to reject the null hypothesis more often. In this case, I was always able to reject it. So that bolsters the basic story I just told that multicollinearity creates an efficiency problem. We don't reject the null hypothesis as much as we could, but our beta estimates are, on average, still accurate. So in short, multicollinearity is a problem. But it's only a problem in so much that we like getting, we like rejecting null hypothesis. We like finding correlations. It's going to reduce our certainty in our findings. So what can we do about multicollinearity? Well, this is probably one of those problems that's the least methodological in terms of its solution. The first thing I would suggest you do is ask yourself whether if you have two measures of something that are very highly correlated at, say, 0.98, 0.99, ask yourself whether those measures are truly conceptually distinct things. If two things are correlated at 0.99, they're virtually in a straight line with each other. And that might lead you to ask if you have two different measures of the two different things or two different measures of the same thing. If they're not conceptually distinct, if you decide they're not, you might just drop one measure. Because if you have two different measures of the same thing, you don't need them both in the regression. You might try some kind of method to combine the measures into a single measure. So you might create an additive index. You might do factor analysis to extract a common factor out of the multiple measures of the same concept. All those things are reasonable. But they're only reasonable if you have a truly good theoretical basis for deciding that these things are really measures of the same thing. And that's not always true. Sometimes measures are highly correlated, but that doesn't necessarily mean they're measures of the same thing. In that case, your options get a little less appealing, at least in terms of their ease of execution. One thing you can do is increase the sample size n. So why does this work? Well, as we've already pointed out, the variance of beta is sigma squared x transpose x inverse. And that's an equal sign there. And we can rewrite this as 1 over n sigma squared 1 over n x transpose x inverse. So all I've done is just put 1 over n divided by 1 over n, just a tricky form of 1. And you'll notice the limit as n goes to infinity of this quantity is the limit of 1 over n sigma squared right here. That part of it goes to 0 as n goes to infinity. It's not a very good infinity. This part of it over here, the limit as n goes to infinity of that part, 1 over n x transpose x inverse, goes to some fixed quantity s sub x as n goes to infinity. So what this tells you is this is going to 0. This part here is going to 0. This part here is going to some constant. So the overall value of variance beta hat as n goes to infinity is going to go to 0. It's going to get smaller. So what this is telling you is if your variance of beta hat is too big, you can collect more and more and more and more n, and it'll shrink smaller and smaller and smaller. And in fact, in the limit, if you collect an infinitely size sample, the variance of beta will go to 0. So even in cases of multicollinearity, more data is better. Collecting more data will shrink your confidence intervals, will allow you to extract greater information from the s, greater confidence from the estimates. And you may be able to reject more null hypotheses when those null hypotheses are actually false. That's not always something you can do. You can't always collect more data. But if you can, it will help. And then the final option is do nothing and just accept less efficient estimates and make a note in your research that, look, these two things are highly correlated. It's necessarily going to be hard to separate their influence on why. And that's it. You just say that that's the case and move on. There are limits to the information we can extract from data sets and multicollinearity imposes some of those limits. That's not necessarily the end of the world for most projects unless it's extremely critical to your project to be able to distinguish the impact of two different things that are very highly correlated with each other. If, for example, your theory predicts that both of those things are positively correlated with why and you find that they're really multicollinear and that the overall correlation between the combination of the two is positive, then, in my opinion, I would count that as evidence that's relatively confirmatory of your theory. It would be better if we could collect more data and separately identify that both were positively correlated separately with why. That would be better. But it's not exactly the end of the world if we find that they're both correlated with each other and the aggregate is positively correlated with why. The real problem is gonna be in a case where you're trying to say that one thing is not correlated with why and one thing is. Or it's really important that one of those two things is positively correlated with why. And the other one's just irrelevant or a nuisance. In that case, multicollinearity can cause a problem because it can interfere with your ability to pin down exactly that one of them is definitely correlated with why. And in that case, you're gonna really need probably either to collect new data or maybe try to rethink the conceptualization of your project. All right, let's talk a little bit about omitted variable bias. So first, what is omitted variable bias? Well, let's consider a data-generating process like this one right here, where y is a function of two variables and an error term. And let's suppose that you estimate, this should be y hat actually. No, that's not right, yeah, it's y because your hat is included. So you estimate a model where you only include x one. You omit one of the relevant variables. In this case, you omit x two. What's gonna happen to you and to your model if you do that? Well, the short answer is that beta one hat will be biased and the size of the bias will be proportional to the degree of correlation between x one and the size of beta two. And let's go to an R example to sort of see what's going on here. So what I'm gonna do is create a fake data set. And x in this case is gonna be two variables that are correlated with each other at a pretty high level. The covariance is 1.5. So if I run this model and then I calculate the, oops, I calculate the correlation in x, the correlation is 0.84. So these are reasonably correlated variables. And now what I'm gonna do is run a model where I only include one of the x variables. Now, the coefficients on both of these variables should be two. Okay, you can see right here in this beta coefficient that I've, the beta vector that I constructed, the coefficients should both be two. However, they're not. In the model I run, I recover an x coefficient in this case of 3.48. I've omitted x two. And what's happened is the explanatory power of x two has been sucked into x one because x two and x one are correlated and I haven't included x one in the model. So this is an example where I'm getting an upward bias in my estimate of the effect of x one because I've omitted x two. Now one important thing to remember is that this problem only occurs when the omitted x variable is correlated with the included x variable. So I'm gonna rerun this simulation but I'm gonna reduce the level of correlation between the two variables by a substantial factor. Now they're only correlated at 0.1, approximately 0.1. If I rerun this model, omitting x two again, you can see that my coefficient on x one is now much closer to the true estimate of two or the true value of two than it was before, even though I'm omitting a relevant variable x two. That's because quite simply x one and x two are not as related to each other and so x one is not proxying for x two the way it was in my previous regression. If I do a dataset where x one and x two are totally uncorrelated, so the covariance between them is zero, I get an extremely accurate estimate of the beta hat on x one, even though I'm not including x two. So the upshot of this is that omitted variable bias is only really important when the omitted variables are correlated with the variable that you care about. This is the point of some of those articles I assigned for this week to talk about simplicity in regression specifications as being a valuable thing. A lot of people worry about omitted variable bias because as I've just shown you, the bias is quite real. And as a result, they throw in lots and lots and lots of things they think might be relevant to the regression just to guard against the possibility of omitted variable bias. The point of those articles, we're not gonna go into great depth in this, but doing that comes at a price. Throwing in lots of relevant variables can cause all sorts of mis-specification problems and can also make your regression less efficient. And what I'm trying to communicate here is that doing that is kind of unnecessary if the things that you're throwing in are not actually correlated with x one. So the time when it's important to think about omitted variable bias is a case where there's a variable that you can't collect or that you've neglected that you have strong reasons to believe are correlated with the thing you really care about. So the metric or the decision rule for deciding whether to include a variable in a regression should be, is it as a control, I should say, is this control variable likely to be correlated with both y and the independent variable that I really care about? If those things are both true, then it's probably a good idea to go ahead and include those controls in your regression. If they're both not, if either one of them is not true, then it's not clear that you really need to include that variable as a control. And if you do so anyway, the articles that I assigned for this week give some compelling evidence that throwing in these irrelevant variables can make your estimates of the thing you care about, the effect you really care about, more variable. You can get more inaccurate estimates. So parsimony is a virtue in regression specification and the appropriate level of parsimony is dictated in part by whether the control variables you're thinking about, including are likely to be strongly correlated with the thing that you care about. So that's the short practical answer with a little bit of our demonstration. There is a formal proof of a biasness problem here. So let's take a look real briefly at that proof. So the unbiasedness proof is about trying to figure out the difference between an estimated beta one and the true value of beta one. And what that's gonna be is X transpose X inverse, X transpose Y minus beta one. So this is the formula for beta one hat. And X one here, we're just talking about the particular X one variable of interest. So that's why the one subscript is there. Okay, so now I'm gonna move this over so I have a little bit more room. This equals, I'm just gonna substitute in the, right here, the true value for Y. And we've omitted an X two variable from this regression. So X two is gonna now pop up in that location. So you've got X one transpose X one inverse, X one transpose times X one beta one plus X two beta two. Plus U minus beta one. Now, let's see, multiplying this out, we're gonna get, now things are starting to get a little bit long. Expected value of X one transpose X one inverse, X one transpose X one beta one plus X one transpose X one inverse, X one transpose X two beta two plus X one transpose X one inverse, X one transpose U minus beta one. Now, let me talk a little bit about what we've got here. This, this times its inverse is just one. So that part of it is just gonna be beta one. This is the formula for regression of X one on X two. X one transpose X one inverse, X one transpose X two. You see that that's the formula for beta, except instead of Y here, we've got an X two there. So what that's telling you is, we're basically gonna run a regression of X one and X two and the bigger the beta coefficient on that regression is, the worse the multicollinearity problem is going to be. So I'm gonna call this like a row, we'll say it's not exactly equal to the correlation coefficient, but I'm just gonna throw in row as the marker there, row hat. No, it actually, row is the expected value. So we'll put in just row, beta two. Now by assumption, this is zero. So this just dies and then we've got minus beta one. So there's beta one, beta one plus row beta two minus beta one. Well, what have we found out? If the correlation coefficient between X one and X two is zero, that will die and our expected difference will be, our expected bias will be zero. So if that green quantity equals zero, then the bias equals zero. The bigger that green quantity is, the bigger the bias is supposed to be. The bigger the gap between the estimated value beta one and the true value of beta one. This formally proves is that you really only need to care about multicolline, I'm sorry, you really only need to care about admitted variable bias when you think there's a strong reason to believe that X one and X two are powerfully correlated with each other. It's a formal demonstration of the conditions under which omitted variable bias really matters. So in short, omitted variable bias is a problem. It can be a very significant problem in the sense that it can make your beta coefficients misleading, either going in the wrong direction, too big, too small. But the conditions under which it poses a problem are limited and the time you really need to worry about it is when the omitted variable is very likely to be correlated with the independent variable of interest. All right, so now for the last problem, heteroscedasticity. Heteroscedasticity just means non-constant error variance as we discussed before. And this is a problem because the classical linear regression model and as the classical linear normal regression model too assumes that first of all, that all error terms I have a constant variance sigma hat or a sigma squared, sorry, that the expected correlation between any two error terms I and J is equal to zero and that the variance covariance matrix of U or UU transpose is equal to sigma squared I N by N. So in other words, it's just a diagonal matrix of dimension N square with sigma squareds down the diagonal. But suppose that that is wrong. Suppose instead that each one of the error terms has a different variance. So the expected value of UU transpose is omega and omega is still a diagonal matrix but instead of each element of that diagonal being the same, each one is different. So now for example, U one has a variance of omega one squared and U two has a variance of omega two squared and U three has a variance of omega three squared and none of these are equal to each other. That's heteroscedasticity. What are the consequences? Well, the consequences are somewhat severe in the sense that the variance of our, the estimated variances of beta hat that come out of the ordinary least squares regression framework can be either too big or too small and we don't necessarily know whether they're gonna be too big or too small until after we execute the, well until unless we know the underlying DGP. So let's take a look at a little R example that can demonstrate what some of the problems are here. So here's R and what I'm gonna do is create a fake data set and my, there's data sets of size 200 and the error terms are gonna be normally distributed as in the CLNRM. However, their standard deviations are gonna be, they're gonna vary with X. So the standard deviation is of the U variable is correlated with X. So there's my fake data set and actually I can add in a little bit of code here to show you what's going on. If I plot U against X, you'll see that what I've done is made it so that the variance of the error term is positively associated with X. Bigger X means greater variance. Now, if I run a standard model and I look at the statistical significance of the coefficients here, you'll see that first of all the true beta on X is two and we recover that true beta very well with this model. The standard error is 0.03 which is statistically significant. So everything kind of looks right but if you plot the model against the residuals right here you get actually the exact same plot I generated before. You can see that the residuals are positively correlated with X. This is a standard diagnostic for heteroscedasticity. If you plot the residuals against a variable and you see that there's some pattern to those residuals, you see for example a fanning out pattern like this one, you know that heteroscedasticity is a possibility in your data set. And the upshot is your 95% confidence intervals will no longer have the correct coverage. So what I'm gonna do is just repeat this thing I just did a thousand times in a simulation. So while that simulation is running I'll talk a little bit about what I'm doing. I'm just doing the same thing I've just done over and over and over again and then looking for the relationship between or basically just looking at whether the confidence intervals cover the true beta value 95% of the time. And in short, they don't. Take a look at this result. The 95% confidence interval covers the true beta only 92% of the time as opposed to 95. That means that the confidence intervals in this case are actually too narrow. They're too small. Which could be a problem if the null hypothesis were true because we'd be rejecting the null hypothesis more than 5% of the time. So that can be one important consequence of heteroscedasticity. We might get results that are overconfident. We can also actually get results that are underconfident. We can reject the null hypothesis too little of the time. I could construct a simulation to show you that. I think it's a little more worrisome for most researchers to reject the null hypothesis too much. But either way, we know that the standard error is gonna be problematic. So if you ignore heteroscedasticity, if you run a regression, look at the pattern of the relationship between the residuals and your independent variables. You see there's a pattern there and you ignore it. You're running the risk of your hypothesis tests being either overly confident or underconfident. And you could be rejecting the null hypothesis too often or not often enough. And you won't necessarily know a priori, which is true. And given that there is a good possibility in that the usual case is that you're rejecting the null hypothesis too much, your results may be suspect in the sense that you may be mistaking noise for genuine discoveries. Okay, so how do we know that heteroscedasticity is a problem? I showed you an example in R of it causing a problem. But we can also show formally that there's a problem here. So I'm gonna just really briefly go over some results that should demonstrate this to you. So the estimate of a particular beta hat i, the variance of a particular beta hat i, is one over n omega xi transpose omega xi or one over n times the sum from j equals one to n of omega j squared xji squared, where i indexes a column of x and j indexes a row. So all I've done is just extracted a row out of the VCV matrix here and showed you the computation of that one particular value of the variance for that particular element. Okay, now what I wanna do is suppose that the average of omega i across i has a particular limit. So what I wanna say is the limit as n goes to infinity of one over n omega transpose omega is sigma squared. What that is saying is that the average of the particular values of the average unit, the average unit variance, the average variance of a unit, the average variance of a particular observation, if you average across all the observations that has a fixed average sigma squared. So each observation has a different variance, but if you average all those variances, you'd get a single average variance sigma squared. Okay, that was a little bit confusing coming up. If we suppose this, which is pretty trivial to suppose, then we can put a limit on this complicated expression up here. In particular, what we can say is the following. The limit of the expression above, that expression as n goes to infinity, is the following. Now, I'm actually not gonna prove that. You can see it in Davidson and McKinnon if you'd like to see a proof of this. But I am gonna use this to illustrate what's going on here. The extent to which this expression is different from the variance, variance, matrix assuming homoscedasticity is a function of the relationship between the amount of heteroscedasticity and x. What this suggests is that the form of the heteroscedasticity has to be related to x to be consequential. If it's not, there's not gonna be a problem. And the reason for that is, if there's no relationship between the heteroscedasticity and x, then this quantity right here, this highlighted quantity is gonna go to zero and the limit of the heteroscedasticity, I'm sorry, the limit of the omega i is gonna tend toward sigma squared. So what this suggests is plotting the residuals against the x columns to determine the presence of heteroscedasticity. Let me boil this down to a little bit more of an applied framework. So what this is telling you is, look, it doesn't really matter if each observation has a different error variance. If those error variances are not related to one of the predictor variables. Because if they're not related to predictor variables, it's the same as if all the error variances just shared a common sigma squared. Heteroscedasticity is a problem precisely when the error variance is correlated with x. So that's exactly why, back in this R example, we constructed a plot of residuals against x because we're looking for that pattern. If you find that pattern, then you have a problem that you need to fix. If you don't find that pattern, then you don't really have a problem you need to fix. All your standard errors are gonna be fine. So in diagnosing heteroscedasticity, it's important to construct these plots in order to figure out what's going on. Okay, so now the question is, what can we do about heteroscedasticity if we think it's there, if we think it exists? And fortunately, there are several things we can do about heteroscedasticity. As promised, this is the most methodologically oriented of the correction. So the patches, so to speak, for heteroscedasticity are numerous and there's been lots of work done on them. So let's spend a little time talking about that. So the first correction I wanna talk about is the Huber-White heteroscedasticity robust standard error. Heteroscedasticity robust standard errors in general are transformations to the VCV matrix that an analyst applies in order to correct for potential heteroscedasticity problems. So the Huber-White correction is the following. So the standard, here we go. So the standard VCV for beta is given by sigma squared X transpose X inverse. Right. But Huber and White say, well, the assumptions that go into constructing that VCV are no longer correct. So instead, what we're gonna do is construct the following. Variance of the beta hat or the VCV matrix is that. X transpose X inverse X transpose omega hat X X transpose X inverse. So what we're gonna do is try to estimate omega hat, which is the form of the heteroscedasticity. And they're gonna estimate omega hat by taking a diagonal matrix of U, U transpose or U hat, U transpose. So in other words, construct U, U hat transpose as the VCV of the errors. Only take the diagonal elements set all the non-diagonal elements to zero. So assume that the errors are still uncorrelated with each other. And then take that diagonal element and substitute that in for omega hat. That's their estimate of omega hat. If you apply this VCV, then they claim everything's great and your estimates are in general accurate. A few notes about this. So this procedure is only asymptotically valid, which is a way of saying that the Huber-White correction is only accurate technically speaking in infinitely sized samples or very large samples and must be very large in order for the Huber-White correction to be good. The reason for that is that this quantity here is only asymptotically unbiased. It is, in other words, it's consistent, not unbiased. There are all sorts of reasons for that that we don't need to belabor. But the important thing to remember as an applied statistician is that it's, Huber-White standard errors are not suitable for small samples. Not suitable for small samples. Now an immediate question that arises is what counts as a small sample? How small is too small? There's no hard and fast rule, but I think it's safe to say that an N of at least 100 is probably a good rule of thumb and bigger is better. There are lots of implementations of this estimator. We're gonna talk about the R implementations in a second. If you happen to use Stata, Stata's command for regression, as you may recall, is regress yx. If you type comma robust, it constructs the robust standard errors. It actually constructs something very close to the Huber-White standard errors. This thing I've just showed you is called the HC0 variant. Stata actually implements the HC1 variant. All you do is take HC0, which is what I just showed you, and multiply it by N over N minus K, where K is the rank of X. That's just a small sample correction that makes the performance of the estimator slightly better. So when you type regress yx comma robust, Stata is doing something very close to Huber-White standard errors, just inflating them by a very small amount to correct for a sample size. That does not fix, incidentally, Stata's correction does not make it okay to use Huber-White standard errors in small samples. The small sample bias, or the small sample problem that I already noted, this still applies. Okay, so there are other heteroscedasticity robust standard errors, and one of them I wanna talk about is HC3, which is Efron 1982's variant of Huber-White standard errors. And again, I don't wanna belabor this too much, but I'll just briefly show you the formula for the Efron standard errors inverse. Where Hjj equals Xj, X transpose X, Xj transpose, and Xj is the jth row of X. This sort of looks like the Huber-White standard error, except the thing in the middle, the omega hat is different. And what it's doing is it's just weighting the residuals by something. It's the same omega hat, but applying this Hjj weight. That residual weighting corrects for the fact that the variance of OLS residuals for more influential observations is greater than the variance of OLS residuals for less influential observations. Hjj is the hat matrix that gives you the extent to which one particular observation is influential. Efron's standard error, it turns out, has better performance in small samples than HT0 or HT1, which are the Huber-White statistics that I showed you before. And here's a graph from a published article. This actually comes out of Long and Irvin 2000. So let me just make a note of here. Long and Irvin 2000 did a Monte Carlo study of all of these different statistics. In order to determine their performance in different situations. So this is performance when the underlying data generating process is heteroscedastic. And there are four coefficients as being estimated in the model of constant and three variables. The null hypothesis in this case is true for all of the cases. So we're looking for 5% rejection rates. In other words, we want to reject the null hypothesis 5% of the time when the null hypothesis is true. As we already discussed, heteroscedasticity makes it possible that they will be rejected too much. We will reject the null hypothesis too often when it's true. And as you can see, each one of these lines is labeled with the HC variant. So HC0 is the Huber-White standard error. HC1 is that state of correction. HC3 is the Efron variant I just showed you. And what you'll notice is that in very large samples, all of them kind of do okay. The square box, if you see it here, is just as I recall correctly, the raw performance with no robustness, with no robustness correction at all. Everything does pretty well in gigantic samples. The heteroscedastic variable, which is this one right here, you can see that the raw, unadjusted standard errors reject the null hypothesis way too frequently, about 12% of the time, even in large samples. However, the robust standard error HC0 can actually make the problem worse, even for the heteroscedastic variable in small samples. So, sample size 2550, HC0 is actually rejecting the null hypothesis even more frequently than a raw, unadjusted standard error for the heteroscedastic variable. That's bad, that's very bad. And if for the non-heteroscedastic variables, it's also increasing the rejection rates. So kids, don't use HC0 in small samples where small is anything less than, oh, let's say about 100, where you can see that's the point where HC0 crosses over for the heteroscedastic variables, being better than a raw correction, or better than no correction at all. HC3, the Efron standard error actually does extremely well, it's quite conservative, and hovers close to .05 in all situations. So, it would probably be advisable to use the Efron standard error correction under all circumstances, and since it's, as I'm about to show you, as easy to use as any correction, I think it would be a good idea to use it, if you can. I also wanna show you briefly what happens under homoscedastic conditions. So, suppose there's no heteroscedasticity at all. As you might expect, here's the box here, that demonstrates the non-corrected standard error results. The non-corrected standard errors do fine. The HC0 corrected standard errors reject the null hypothesis way too often. And the HC3 standard errors do well, everything's fine. This graph on the right shows the power, which is the probability of rejecting the null hypothesis when the null hypothesis is false. And what you can see is that HC0 does have slightly greater power, just because it's more likely to reject anything, including true null hypotheses. But the power of HC3 is actually quite close to HC0 across the entire range of sample sizes. So, you do lose a little bit of power, i.e., you lose a little bit of ability to correctly detect effects where they exist, but not a whole lot. That's good news for using HC3. In other words, it says you probably ought to use HC3 because the trade-off for using it is quite small. Let me show you how to estimate these various standard errors in R. This requires the CAR library. So, if you haven't already installed the CAR library, you should do that. Just use the regular install packages dialogue and get CAR with all its dependencies. What I'm gonna do is construct a model data set with heteroscedasticity. So, there's the data set. I've got a model here. There's the heteroscedasticity in the errors. Let's suppose I wanna construct white heteroscedasticity robust standard errors. So, the command to construct heteroscedasticity consistent covariance matrices, or HCCM, is this HCCM thing right here. And if I run HCCM model and then specify the type of standard errors I get, I get a VCV. So, I wanna take the square root of that VCV to get the standard errors and then just get the diagonal elements of it because all I care about in this case are the standard errors for all of my coefficients. In this case, I've got one coefficient here. So, if I run this model, there are my, actually I have two coefficients, the intercept and X. Here are my white standard errors. Now, you can see the raw, unadjusted standard errors are 0.174 and 0.029. The white standard errors are 0.176 and 0.032. In other words, the intercept standard error is a little bigger and the, I'm sorry, yeah, it's a little bigger. And the X standard error is a little bigger. That's good news. You should definitely see bigger standard errors with whites, heteros to SSC robust standard errors. And as a note, if you ever see smaller standard errors, if robust commands ever make your standard errors smaller, that is a primathecia indication of overly small sample size and downward bias. In other words, if you ever see a case where robust standard error corrections make your standard errors smaller, make it easier to reject the null hypothesis, you should not use them. You should discard them. That's a bad sign in and of itself. It indicates that the asymptotics are not kicking in. Your sample size is not big enough. So if I wanna conduct t-tests with these robust standard errors, I just do the usual t-test formula, beta divided by the standard error. Here are my t-values. You can see my t-statistics are lower for both of my variables. That's as it should be. If I wanna use the Efron variants, the HC3 variants, all I do is just change the type to HC3. So if I repeat all these commands with HC3 instead, you can see that my t-statistics drop even further, which again is correct. The Efron standard errors are designed to work better in small samples and are consequently more conservative. Now suppose I wanna do this process over again, but now I construct a dataset where there is heteroscedasticity, but the heteroscedasticity is not correlated with the error. I'm sorry, not correlated with the predictor X. So here's my dataset and you can see I've constructed an error variable that does have heteroscedasticity in it, but that heteroscedasticity, so the standard deviation of the error is just a random variable that goes anywhere between a half and two, but that is not correlated at all with my predictor X. So what happens if I try to construct robustness tests for this case? So I run the model and then construct the t-tests. So my t-value for the raw uncorrected coefficients is for the intercept, my t-value is 31, and for the X, my t-value is 118. The robust standard errors, HC0 actually are, the t-tests are bigger for X. In other words, it's inflated. That's not good. That means we're gonna reject the null hypothesis more often than we should. This is an example of white's heteroscedasticity, robust standard error, HC0 failing in a small sample. In this case, the sample is even as big as 200. The effron standard errors give me a t-test of 119. That's very, very close to the 118 value that we get from the uncorrected standard errors. So what this is telling us is that the robustness checks actually don't make a big difference in this case. And the HC0 actually, in this particular example, is making the t-tests too big, or the confidence interval is too narrow. So in short, if you suspect heteroscedasticity, the effron standard errors are the better examples to use, but this correction is only gonna be relevant in cases where you have reason to believe that the heteroscedasticity is correlated with X. Now comes the truly fun part of the lecture, cluster robust standard errors. So cluster standard errors are extremely popular in a lot of different disciplines, economics in particular, but also political science. But I think it's fair to say that they're not very well understood and there aren't a lot of good sources that explain what exactly they're doing. And the biggest danger is that they're much easier to create than they are to understand. So you may be aware that Stata can implement clustered robust standard errors by just typing regress yx cluster cl there. So cluster takes the place of the robust command and then cl there is a cluster variable. It's a variable that defines, it's a grouping variable for certain data points. So one example is suppose you've got some kind of international relations data, some dependent variable and some independent variable, but you've got repeated observations inside of a country. You might cluster the standard errors on country because all the observations by country are similar to each other. This is something that people do all the time in all sorts of different data sets. And the idea is that the point of clustering is to correct for the idea that errors are correlated inside of groups, but not between groups. So here's what's going on. The white's robust heterostasis, your robust standard error and the Efron variant all set omega hat equal to some version of diagonal U hat transpose, I'm sorry, U hat, U hat transpose. That allows there to be no correlation between error terms but the each error term can have a different variance itself. What that's this right here, that's the Huber white error matrix. Clustered standard errors, instead of putting each observation on the diagonal, they put entire matrices of groups on the diagonal. So what you do is block out the observations by group and then construct this matrix where the errors and observations can be correlated with other observations in the group. Let's talk a little bit about what exactly is going on here. So Huber white's heterostasis, HC0 in other words calculates X transpose omega hat X as we've already discussed. And what that looks like, this can actually be rewritten as X transpose U hat, X transpose U hat transpose where X is the N by K matrix of predictors and U hat is the N by one vector of estimated residuals. So what's going on here is that this right here looks like this. X transpose times U one squared, U two squared, blah, blah, blah, U N squared hat, hat, hat times X, right, that's what that looks like right there. We can rewrite this. Suppose there are two variables and N observations. We can write this as X one, X two, blah, blah, X N, Z one, Z two, blah, blah, Z N, where now X and Z are the two independent variables and there are N many observations. Now U one hat, blah, U N hat squared and this is gonna be a column vector X one, X two, blah, blah, X N, Z one, Z two, blah, blah, Z N. If you multiply these things out, what you get is, well let's first multiply these first two together. It's row by column and all of the non-column entries in the Huber-White standard errors are all zero. That's an important property of the Huber-White standard errors. So we're assuming that all of the off diagonal correlations are all equal to zero. So when you multiply this row by this column, you just get X one, U one, hat squared. So what you're gonna get here is, X one, U one, hat squared, X two, U two, hat squared X three, U three, hat squared and so on all the way out to X N, U N, hat squared and then a second row corresponding to the same multipliers except by Z, blah, blah, Z N, U N, hat squared. That whole thing is now gonna be multiplied by X one, X two, X N, up, off track there, Z one, Z two, blah, blah, Z one, Z two, blah, blah, Z N. Okay, so now, multiplying these two things together, I'm gonna need to probably give myself a little more space here, multiplying these two things together, you finally get a really big matrix of this. X one squared, U one squared, plus X two squared, U two squared, plus blah, blah, blah, plus X N squared, U N squared. That's the upper left hand element. So now I'm gonna draw a line here so we can see that these are, they're gonna be four elements of this matrix. Why are there gonna be four elements of this matrix? Well, this is two by N, so K is two in this case. This is N by N, this is N by two, so the resulting matrix is gonna be two by two. This is gonna be X one, Z one, U one squared hat, these are all hats on the U's, plus X two, Z two, U two hat squared, plus blah, blah, blah, plus X N, Z N, U N hat squared. So what we've got here is a sum of squared errors multiplied by the values of X, which gets at the idea that we're looking at the correlation between the errors and X as the key factors in the heteroscedasticity. The lower two elements are just gonna be, this element here is gonna be exactly the same as that element there. You can multiply it out to verify that. So what we've got here is a sum of squared errors that this element is gonna be X, I'm sorry, strike that, Z one squared, U one squared, plus Z two squared, U two squared, these are hats on the U's, plus blah, blah, blah, plus Z N squared, U hat N squared. And I hope you'll see we can rewrite this as X one, U one, X two, U, yeah, I'm sorry, X two, U two, X, well, let me just skip the three, blah, blah, blah, X N, U N, all hats on the U's. Z one, U one hat, Z two, U two hat, blah, blah, Z N, U N hat times a column of, actually that same thing except transpose. X one, U one hat, X two, U two hat, blah, blah, X N, U N hat, Z one, U one hat, Z two, U two hat, blah, blah, Z N, U N hat. Okay, so that's what Huber-White standard errors or HC zero standard errors are doing. What clustered standard errors are doing is substituting something in for, so you can see right here, we've got different, we've got multipliers of X times U hat. What they're doing instead is, instead of individual observations being located there, each one of those observations is a group-wise sum. So in other words, if I was gonna write this out for clustered standard errors, I would write it out and I'm writing it in red so you know it's different in the following way. X, how do I write this? I'll write it like this. Sum of X I U I N one to N one, U I I N one to N one, where N one is the number of observations in group one. The next element would be I N one to N two, X I U I blah, blah, blah, sum of I N one to N G, X I U I, next column X I, or I'm sorry, Z, Z I U I I is one to N one, sum of I N one to N two, Z I U I blah, blah, blah, sum of I N one to N G, X or Z I U I, that matrix is that matrix. And if we transposed it, we would get that matrix. What's different? Well, what we're doing is we're saying that there are G many groups. Each group J has N sub J observations in it. And what we do is we add up the sum, or we sum X I times U I for each group separately. So we construct group-wise sums. That matrix, and so we've got groups in the columns, this is groups, number of groups here, and we've got the number of variables K in the rows, in this case K equals two for this particular example. That forms a matrix we can call, say, like E. And then, going back up here, we saw that the Hebrew white standard error was gonna be X transpose U, X transpose, or X transpose U hat, X transpose U hat quantity transpose. So if I call this matrix right here E, then now in the clustered standard errors, Omega hat is gonna be E, E transpose. So take this thing, multiply it by its transpose. That's our new cluster standard error matrix. I don't wanna belabor this just too much because you can really go down a rabbit hole trying to figure out why all these things hold. But suffice it to say that this construction of the robust standard error matrix allows or corresponds to this formulation, which I've just highlighted, of the cluster standard error matrix. So this formation allows the standard errors, I'm sorry, allows the variances of error terms inside of a group to be correlated with each other. So for example, in a particular country, we might expect all the errors to be biased upward because there's something about that country that tends to increase its GDP. For example, remodeling GDP, we might expect some model of GDP. All the German observations might be higher than average because Germans are really productive, who knows. We might expect in another country, all the errors to be biased downward because net of the model, some particular country like France or Belgium just tends to have lower GDP for some reason. So all its errors are biased downward. That's what clustered standard errors are designed to allow. And this is how they're, this is the construction of the omega hat that allows that to happen. One thing you might notice is that each element of E is now estimated by a group-wise sum. So the consequence here is that because you're estimating each element of omega hat with G group sums, it's really G that matters, instead of N. If, for example, you had a million observations, but only two groups where a half a million of the observations were in group one and half a million of the observations were in group two, and you ran a model and applied clustered standard errors because you believed that the errors inside of these group variables were correlated with each other, then it's very likely that the standard errors that came out of that estimation would be too small, which is to say that your confidence intervals would be too narrow, which is to say that you would reject the null hypothesis more often than you should. And so the upshot of this is that you need to think about the number of groups that you're using whenever you think about applying clustered standard errors to correct for this particular kind of heteroscedasticity. Angerson-Piskey demonstrate in their book Mostly Harmless Econometrics that a good number of groups is roughly speaking 40. So if you have fewer than 40 groups, it's probably a bad idea to implement clustered standard errors, and the result of that is probably going to be that your standard errors will be too small and you'll reject the null hypothesis too often. If you have more than 40 groups, it's possibly a good idea to implement clustered robust standard errors because they will stop the correlation between the errors inside of groups from biasing your test statistics. So I have a little simulation in today's lecture that I wanna show you, and I've already run it so that we don't have to sit here waiting for it. But the simulation accomplishes two goals. Firstly, it shows you how to estimate clustered robust standard errors in state. I have a function called CLX that I copied from Mamuda Rai, who wrote it in 2008, that constructs the clustered robust standard error matrix for you. Then what I do is I create a Monte Carlo data set where I've got four clusters and 11 observations per cluster. So that's a total of 44 observations in the data set, pretty small. Then I'm gonna construct confidence intervals using the clustered robust standard errors and the standard standard errors, the vanilla standard errors. And so my model here is zero X, so there's X has no relationship with Y, plus a normally distributed error term plus GG, and GG is just a constant added or subtracted to the group mean for each group. So in other words is correlation of the errors if you don't include a group level predictor. In other words, if you don't include a dummy variable for group, the errors between groups are gonna be correlated. Then I run a standard model predicting Y with X, completely neglecting the group structure, produce a summary, and then what I say is okay, I wanna construct confidence intervals for each model, and I wanna see the proportion at the time that the confidence intervals cover the true beta, which in this case is zero. And what you can see down here in the results is that my clustered standard errors rejected the null, or I'm sorry, failed to reject the null, which is to say got the right answer, 83.7% of the time, whereas the vanilla standard errors did so 93.7% of the time. Both of them are rejecting the null more than 5% of the time, which is bad. But the clustered standard errors are actually doing it more often. The cure is worse than the disease in this case. So the upshot is whenever you even wanna think about using clustered standard errors, you wanna be very careful to think about the number of groups that you have, and make sure that the number of groups is large enough to be able to sustain the asymptotic assumptions, or to sustain asymptotic consistency, as is assumed in the construction of that particular estimator. All right, so I wanna conclude with one last word for the day, which is a warning. The warning is that heteroscedasticity is not always a problem that you fix by slapping white's heteroscedastic robust standard errors or Efron standard errors or whatever. It's not always a problem that you patch by typing robust after your regression. In particular, heteroscedasticity can be caused by problems that are best addressed in this model specification, rather than in a post-estimation patch to the standard errors. Let me give you an example of that. So suppose as we did in a previous week that there was a U-shaped relationship between X and Y, but you estimated a linear relationship. So I'm just gonna sketch out really quickly what that might look like, putting in a little axis here. Suppose we have some observations, and those observations tend to be U-shaped or quadratic, and instead of a U, you throw in a line. So you end up with a regression line, maybe that looks a little something like that. Perhaps I'm gonna put some more observations into that line, it looks more sensible. Now, if you plotted the errors, the estimated residuals for this regression, you'd see a lot of positive errors, and then a lot of negative errors, and then a lot of positive errors. That you could say is heteroscedasticity, because the error variance is obviously non-constant, and it's correlated with X, right? This is Y, this is X. Those errors are obviously gonna be correlated with X. But that doesn't mean that this problem is best fixed by typing robust in-stata after your regression or by applying any kind of patch that's designed to fix heteroscedasticity. That's because in this case, the heteroscedasticity is present as a result of a model mis-specification error, namely the omission of a quadratic term for X. And the best way to address heteroscedasticity in this case is just to add an X squared term to the regression. There are lots and lots of other examples where this can happen. So the thing I wanna make clear is that not all heteroscedasticity is a heteroscedasticity problem per se. Sometimes these heteroscedasticity problems are specification problems. This note or warning is gonna become even more relevant in future weeks of the course. So we're gonna talk, for example, at some point about binary dependent variable models where Y can only take on zero and one values. If you try to model that kind of dependent variable with a straight line as in the linear probability model, you're gonna see that heteroscedasticity is an inevitable consequence. In that case, you can patch it with some kind of robust standard error, but it would probably be better to just implement a curvilinear function that takes account of the boundaries of Y. And that's what we're gonna do. There are other examples of this as well in panel and time series data sets where certain kinds of heteroscedasticity arise as a result of correlation over time of the errors. You can apply patches to that kind of thing, but a better way to address it is maybe to try to model the time dynamics. So for right now, I just want you to have in your mind that heteroscedasticity can be a problem in and of itself, but it can also be a symptom of other problems. I've shown you one example right here where it's a specification problem. There are other similar examples like that. And so whenever you encounter heteroscedasticity in your regression diagnostics, don't necessarily immediately jump to some kind of robustness test. Think about whether there's a specification problem in your model first before you apply that patch. All right, thanks for watching and I will see you next week.