 Heteriskadastisitikonsisten stanna-dars- tai roba-stanna-dars- ovat aika common in empirical work. Their extension is cluster roba-stanna-dars, which takes care of the non-independence of the observations. These techniques are fairly simple to use, because they have been programmed in the many commonly available statistical software, and if your software supports these stanna-dars, their use is simply a matter of switching them on. You don't need to really understand the math behind how the stanna-dars are calculated. But in this particular case, going through the math behind these two types of stanna-dars, allows us to learn something about what these stanna-dars do, what they are capable of, what are their limitations, and also in the case of cluster roba-stanna-dars, looking at the equation allows us to learn something new about when clustering of observations would be a problem and when not. In this video, I will walk you through how the heteroskedasticity roba-stanna-dars are derived, and how the cluster roba-stanna-dars are derived, and where do we actually make the heteroskedasticity and the independence of observations assumptions in regression analysis, and what those assumptions actually mean for the calculations. Let's start with heteroskedasticity. The idea of heteroskedasticity was that if we have a predictor here on the x-axis, and then we have a regression line, the population regression line, and then we have the error term, which is the variation of the observations around the regression line, then the variance of the error term is not constant. The idea is that in some parts of the regression line, there is less variation than in other parts like here. So the variance here basically varies as x varies. The homoskedasticity assumption was that this variance around the regression line is constant, so that the observations don't spread out and don't become closer to the regression line here. Of course, you can have many other shapes of heteroskedasticity beyond this simple final shape. And this was the fifth assumption in regression analysis, and it was required for consistent estimation of the standard errors. If there is a lack of homoskedasticity or heteroskedasticity in your data, then your conventional standard error equation will produce incorrect results. So let's take a look at why heteroskedasticity causes problems for the conventional standard errors. So the conventional standard errors are calculated using this equation. This is the variance of the estimates. So we have the sigma here, and this is simply the variance of the error term. We replace the variance of the error term with the variance of the residuals, which is the estimate of the variance of the error term, and then we divide it by sum of squares total x, which is just the sum of squares of x minus the mean of x. So how much x varies around its mean, multiply by a sample size. And that gives us an estimate of the standard error of regression coefficient in the simpler regression case. So let's take a look at first where this simple equation comes from and why we need the homoskedasticity assumption. If the homoskedasticity assumption fails, then we have this alternative formula, which can be found in many econometric textbooks where you take the residuals and then you multiply square residuals with the square differences of the observations from their means, and then you take a sum and you divide by sum of squares total x to the second power and that gives you the heteroskedasticity robust standard errors. So let's take a look at why this works and this doesn't under heteroskedasticity. So we need to start by looking at what is the variance of the regression coefficients and we derive this variance formula on this slide and I will take a couple of shortcuts. You can get the full derivation in your favorite econometrics book and I'll just take a few short cuts to make it fit in one slide. And we will derive a bit one particularly useful form of this equation. So let's start with the covariance of x and y divided by the variance of x, which is the simpler regression coefficient estimated by OSF. We can write out the covariance and the variance equations. So the covariance is simply how much an observation minus its mean multiplied by an observation and another variable minus its mean, what is the average square of these two differences. So we work with differences from the mean, we multiply two differences from the mean and then we take the average, we divide by n minus one, which is an unbiased estimator for covariance. Then we have the variance, which is simply the covariance of the observation with itself. And this can be simplified by eliminating the n minus one because it appears in both the numerator and the denominator and we can further simplify by writing this as a square. So we have these sum of squares from mean. So this is basically a sum of squares residuals from a regression equation that only has an intercept. So this is the sum of squares total x, which is the sum of squares of the null model if we just regress x and use only an intercept. And we write it as sum of squares total x. So we take differences from the mean, we square those differences and we take a sum that this is sum of squares total and we do that for variable x. So let's move on and we can take this equation here, this upper part and we can separate it. So we can write it out as x minus x bar times y one and multiply it by x minus x bar times y bar. Y bar is simply the mean of y. Turns out that this here is actually zero, so we can take it out and we have x minus its mean multiplied by y divided by sum of squares total x. Y, of course, is our dependent variable and it can be written as a function of the population regression model. So y is beta zero plus beta one x plus their term as the model defines. And we can further simplify this equation by splitting it into two. The beta zero is constant and it will be eliminated because this has a mean of zero. And then we have these two parts here, this sum. We have the beta one times x minus x bar times xi and turns out that that is the same as sum of squares total x. And this gives us a convenient formula. So the estimate of beta is beta one plus this thing here. So to understand how much the beta hat or the estimate of beta varies, we need to understand how much this varies here because this regression coefficient beta in the population, that's a fixed value. It doesn't vary. So the only thing that varies here is this part and how much it varies is what our standard error quantifies. So now we start looking at deriving the standard error. So how do we actually estimate how much this thing here varies? And we've write out the variation here. So variation of beta is variation of this sum and we can drop the beta one out because it doesn't vary. It's a population quantity. It's fixed. And in regression analysis, without going into details, we treat the x variables as fixed too. So sum of squares totals, x is fixed and we can, or that's constant in our equation. And when we have variance of constant times something, then that is, or let's say that we have constant times x and we want to take the variance of that, then that is constant squared times the variance of x. So if you remember path analysis tracing rules when you calculate the variance of something, you always go to the source and come back, which means that you take squares, we take squares as here as well. So we have sum of squares total x to the second power dividing this variance of x minus x bar, difference of x from its mean, multiplied by the error term. And we can further simplify this by taking this variance and moving the sum outside the variance once. So the variance of the sum of independent variables is the sum of their variances. So that's the idea here. And this equation now we can still make it a bit simpler because this variance of x minus x bar, this x minus x bar is fixed value. So it's fixed because x is fixed. We can move it outside the variance function so it's x minus its mean to the second power multiplied by variance of the error term for one particular observation. And now at this point we have to make the homoscedasticity assumption. So this far we haven't made any assumptions that this variance of ui would be constant. If we assume that the variance of ui is constant, it doesn't vary as a function of x or anything, then we can actually take this variance of u and move it, replace it, and move it outside the sum function. So we have the variance of ui, which is the variance of error term, multiplied by this thing here, which is simply the sum of squares total. And we have sum of squares total to the second power here and that gives us the variance of error term divided by sum of squares total. And this variance of error term is estimated with the variance of the residuals. So that's the normal conventional standard errors. You can find this derivation in your favorite econometric book if the book is any good and they may explain a few more steps in why, like why some terms were zero. I just stated that there were zero without explanation. So what if we have a heteroscedasticity? What if we can't make the homoscedasticity from this line to this line? So what do we do about it? And let's take a look. So the idea here is that we can't move this variance of ui outside the sum because it's not constant. We can move it if it's constant because sum of different elements multiplied by the same constant is the same as that constant multiplying the sum of those elements. If ui is different for each observation we of course can move it out. So how do we deal with this problem? We deal with this problem by actually replacing the ui here, the variance of ui with the squared residual for that observation. So the idea was that the variance is the mean of square differences from the mean. We know that the residuals have a mean of zero and the error term has a mean of zero as well. So we can estimate the variance for each observation separately by using the squared residual. So we take this kind of equation and that's our heteroscedasticity consistent standard errors. And this heteroscedastic consistent standard error can also be used for regression with multiple predictor variables. In that case, we use matrix equations and the equation looks like that. These are called IKR-huber or white standard errors or a combination of these names based on statisticians who have discussed and introduced these concepts to the literature. This is also called sandwich estimator because we have this x matrix here and we have the other x matrix here and then this beef, this residual squared multiplied by the observation pair squared. Observation squared is sandwiched between these two other matrices. So that's, it's called sandwich estimator for that reason. So you can see that there's some, this is sum of squares total and this is sum of squares total because in matrices when you multiply two things together the order matters. For that reason we have one on the left side one on the other side instead of multiplying it twice. The minus one is inverse which is basically the same as our equivalent to taking, dividing something one by something so you create an inverse and otherwise it looks the same so we take residual squared we have observation squared and then we multiply by sum of squares total. So the matrices are something that if you, that are useful if you want to study this technique yourself but as a normal researcher you don't really have to know how to read all that stuff. So the question now is that if these don't assume heteroskedasticity they are more general because they also work under any heterogeneity or heteroskedasticity then why, when should we use this and why not always is heteroskedasticity consistent standard errors. The thing is that heteroskedasticity consistent standard errors have been proven to work in large samples and there is some evidence that their performance may not be that good when the sample size is very small. So in practice if you have large samples several hundreds or thousands then using heteroskedasticity robust SCs as a standard practice is probably not a bad idea. If you work on small samples like you have experimental data you have just maybe 40 people in each experimental group then you may be better off using the normal standard errors even if there is slight heteroskedasticity in your data. The reason why these don't work as well in small samples is that this residual square here is not a good estimator for the particular variance of the error term in small samples. So this gets better and better as the sample size increases. So heteroskedasticity robust standard errors allow you to deal with heteroskedasticity and the way you use them is that you simply turn them on. Understanding cluster robust standard errors is something that allows you to understand the effects of clustering. So let's take a look at the cluster standard errors. So the idea of heteroskedasticity robust standard errors was in matrix form that you take the residual of one observation and you square it in cluster robust standard errors you take two different residuals below the same cluster and you multiply them together and you repeat that for every observation in the same cluster and why would you want to do that and what's the point and what does this equation or analyzing this equation tell us about the effects of clustering on normal regression model? Let's take a look at this particular part here and why do we have two different residuals? So here we have one residual here we have two different residuals. Of course we could have the same residual but we basically multiply every pair of residuals together. So let's take a look at this derivation of heteroskedasticity consistent standard errors. So that's the heteroskedasticity or that's the normal standard errors. We make the heteroskedasticity assumption here and we actually have to make the independence of observations as something a bit before, so it's here. So why do we need the independence of observations here? The reason is that when we take a sum of these differences from the mean multiplied by the error term the sum of these, the variance of this sum is the sum of these variances only if the observations are independent. So if you take two variables the variance of the sum is the sum of variances only if those two variables are uncorrelated or independent. So what do we do when that fails? So we can't move from here and then move this sum outside the variance. We can't do that because of non-independence of observations. What we actually do here in the cluster of a standard errors is that we calculate this variance it is actually a sum of variances plus all the sum of all covariances in the cluster. If we have ten observations then the variance is 45 covariances, ten variances and the variance here, variance of the sum is the sum of those ten covariances plus two times the 45 covariances or we can just use each covariances twice in the sum. So we take a look at this covariances between the observations in the cluster that covariance can be from multiple different sources. It can be from unobsertated generality so some clusters are on average higher than others and there is no particular pattern. In panel data this can be out of correlation so it can be that observations close to each other in time are more similar to one another than observations that are far apart on time. And what we do is that we take this ui and uj the error turns for two different observations we replace those with residuals we reorganize a bit. So covariance between these two products is because the means are zero is simply whatever is the product of all these things multiplied together. So that's our heteroskedasticity consistent standard and that's the equation in matrix form. Looking at this equation and looking at the variance equation for the regression coefficient in clustered with clustered data allows us to learn something about the effects of clustering. So let's take a look at these two equations. So this is the variance formula if we know the variance or the covariance between two observations in the same cluster in the population so then we can and we know the sum of squares total then we can calculate how much the regression coefficients are estimated from repeated samples from that population would vary from one sample to another. And this is the equation that we use to estimate the variance. So this is an estimate of variance or standard error and this is the actual variance if we know these values. So let's take a look at that part first. So what does that tell us? Use a little bit of covariance algebra and let's write it down and I'm going to use d for x minus mean of x so we just work with different scores and the difference score for i and the ui covariance with the difference score of j and uj is simply whatever is the expected value of the product of all these things minus the means of these two terms here well the means are simply zero so this is eliminated here and the covariance is simply whatever is the expected value of the mean if we multiply all these four things the two error terms and two deviations of x from the mean together and then we take a sum and then we divide by the number we multiply this quantity for each observation take a mean and because the error terms are assumed to be uncorrelated with the predictor so that's the no endogeneity assumption this equation can be separated so it's what is the expected value of these two errors these two deviations from the mean of x and these two error terms which is the same as co-variance between two predictors multiplied by co-variance between two error terms and this equation actually gives us some insights that are demonstrated in another video with a simulation using simulated data set the thing here is that if two observations are independent so if ic1 of x is zero then this term here will be zero and so whatever the correlation between the two error terms is doesn't matter so if your x variables are independent of one another there's no clustering effect no other correlation, no anything in the x variables then it turns out that non-independence of the error term is actually not the problem for your analysis which is kind of an interesting result it's not probably very practical but it explains why in another video we can get clustering effects if we just manipulate the icc of one variable but not the other if we look at the actual equation that we use for calculating the standard error we can look at this part here so because we multiply two residuals together and we do that separately for each pair within a cluster and we do that for each cluster independently this implies that the cluster robust standard errors are valid regardless of how the observations their terms are correlated so there can be a strong auto correlation for some cluster no auto correlation for other cluster and these standard errors don't care because we don't make any assumptions about any covariance we estimate every covariance within cluster by multiplying two residuals together so this is robust for arbitrary within cluster or correlation most traditional techniques for panel data particularly focus on unobstructed or gen 80 and unobstructed or gen 80 manifests in error terms that are correlated within cluster but that correlation is constant so there is no in normal traditional panel data model there is no effect that two observations that are closer to one another are more similar than two observations that are farther from each other so that would be an out require a model for auto correlation so this cluster robust standard errors in contrast to for example GLS fix effects and GLS random effects also allows you to have other correlation structures beyond the basic structure where each observation is correlated at the same level with any other observations so it's more robust and that's one reason why when you work with panel data you should always consider using cluster robust standard errors even if you applied already an estimation technique that took unobstructed or gen 80 into account the second point that we learn from here is that when we have two residuals that we multiply together here then that is a poor estimator of the actual covariance unless our sample size gets large so if the number of clusters is small then the standard errors are typically small too typically biased and this is a problem that you can't solve by increasing the number of observations within clusters so the idea is that if you have let's say you have 30 companies that you follow and you follow them from 10 years so you have 300 observations you could be concerned that your cluster robust standard errors are slightly biased if you have a small number of clusters then increasing the number of observations within cluster from 10 to 20 to increase the total sample size to 600 wouldn't do anything for the potential bias so this depends on whether this is accurate or not depends on the number of clusters and what is sufficient it's difficult to say for example, angrist here well, maybe 40 would be like a minimum limit so if you have observations number of clusters below 40 then you could be in trouble if you have more than 40 then you could be fine but this of course depends on many different things so we can't just set a one cut off that would be useful in all scenarios but that gives you a ballpark estimate of what kind of numbers of clusters are needed for this technique to be really useful so this video went through some of the math to show some insights about how clustering works how it affects the variance of regression coefficients and the key takeaways here is that these techniques are useful but they require large sample sizes if you want to apply these techniques you don't actually have to understand much of the math here because these are typically applied by just switching them on in software that supports them