 Regerson analysis assumes that the sample that you're analyzing is a random sample from the population. That could be violated, for example, if you have 100 observations, but those observations are measured from five different people only, each of which is measured 20 times. What is the impact of non-independence of observations on regression analysis and what kind of problems could that cause for empirical analysis? Let's take a look. So here are the six regression assumptions according to the rule reads and the second assumption is the independence of observations. So what will happen if the observations are not independent and I will go through this with a couple of examples. Let's take a simple example where we are interested in estimating the mean of the population. So our sample is 100 observations here and these 100 observations come from five clusters. So let's say we are observing five companies over 20 years or we are measuring reaction times from five people each measure 20 times and we want to know what is the population mean. If inter-class correlation is zero or there is no dependence between the observations, within a cluster we get a very precise estimate of 0.08 for the mean. The actual population mean here is one and the population variance is, that population mean is zero and the population variance is one. The inter-class correlation is zero. What will happen if we increase the inter-class correlation? So we make these observations that are yellow, green and purple, we make them closer to one another. So let's start clustering the data. We can see that these yellow observations start to cluster here. These purple observations start to go here and green observations go somewhere in the middle. When we increase the inter-class correlation of these data, so this is maintaining the variance of the data, when we increase the inter-class correlation we can see that the estimate of this mean or the sample mean became less and less accurate estimator of the population mean. Originally when we had 100 independent observations our estimate was 0.08 and after we have strongly clustered the data it is 0.61. When the inter-class correlation is one, we have a special case where there is no within cluster variance. So we have 100 observations but only five unique values. So if we only have five unique values then it doesn't make a difference if we have each of those five values 1000 times or just once. Because we gain no new information about what the population mean is because after we have the first observation from a cluster then the other observations will bring no more new information into the analysis. So the idea here is that when our data are independent then each observation brings the same amount of new information to the analysis. When the observations are dependent there is inter-class correlation then the first observation from a cluster brings lots of new unique information to the analysis. But once we have the first observation then the other observations from the same cluster give you less and less data about where the population mean is. For example if we have one to measure what is the average height of people in the university and we have a measurement tape that contains some measurement error somehow then it is better to measure 100 people than to measure the same 10 people 10 times. And of course if you have no measurement error then measuring the same 10 people or same five people over and over and over will not improve the precision of your estimate. So the problem here is that when inter-class correlation increases when there is lack of non-independence then our estimates will be less precise. They are still consistent they are still unbiased but they are less precise. Okay so that's one variable what if we have two variables and we want to run a regression analysis. So we have x and we have y we still have 100 observations nested in five clusters so we have 20 observations for each cluster. Initially inter-class correlation is zero so all these observations are independent there is no particular pattern in the colors and our regression estimates are quite precise. They are actually interested in zero our estimate is 0.1 the actual slope is 1 our estimate is 1.07 so it's pretty close. That's something that you can expect from 100 observations with one explanatory variable in a regression analysis. When we increase the inter-class correlation of both these variables we can see again that there is some clustering so these yellow observations go here and these purple observations go here, green observations go here and ultimately we are when inter-class correlation reaches 1 we are in a scenario where we have just five observations that are repeated and again if we have the same data set we just repeat the observations that gives us no new information for the estimation problem. The outcome is that when both of these variables have clustering effects then our regression coefficients both the coefficient and the slope will be less and less precise. They are still consistent and they are still unbiased but the effect is the same as it was for the effect of estimating or in the case when we estimated the mean from cluster data. So in effect inter-class correlation decreases our effective sample size. So if we have 100 observations that are strongly clustered it's possible that we actually have only five observations worth of information. In less extreme cases we could have something like 100 observations but they actually give us information that is only worth about 20 observations and so on. Things get more interesting if we only have x that is clustered or we only have the error term that is clustered but not the other. So let's take a look at first what happens when our x is clustered but the error terms are independent. So we can see that the inter-class correlation again increases this become more and more clustered until we have just five values. In this case when x is clustered but the error term is not the clustering actually doesn't have an effect. So we can see that regression coefficient and the slope are going to be slightly different when the clustering changes but that's just because when you estimate the same quantity from different samples you will get different results. So there is no systematic effect in the estimates getting worse and worse when inter-class correlation increases. The reason for this is that regression analysis actually doesn't make any assumptions about the dependent variable. Everything is, sorry, the independent variable. Everything is estimated conditionally on the observed value. So we could have a researcher that sets these x values for example in an experimental context. We actually set these people into the treatment group and into the control group so those are not random variable. They are something that we set as research. So we could of course set them however we want and regression analysis would not be affected. What if our x is random, x doesn't or x is not clustered but the error term is clustered. This is something that would be quite an unusual case but it's nevertheless useful to understand what happens. So when we cluster the error term we effectively reduce the variation or the unique values in the error term and it has one implication. The implication is that this intercept is going to be estimated less precisely but the slope estimate is going to stay about the same. One way to understand why that is the case is that these error term values here, even if we have just one value for each cluster, they will give us still very useful information about the direction of the line but not on how high the line is. As you can see all these when the errors are the exact same for each cluster, inter-class correlation is one, then all of these clusters form an exact line that is parallel to the population regression line here but the intercept is estimated less efficiently. So this would of course be a very unusual scenario. Typically if you cannot assume that your error term, the unobtured sources of variation in the dependent variable are not independent, then typically your explanatory variables can be assumed to be independent either. So we either have the case where the error term is independent which would be the case in random sampling but x could be non-independent for example due to manipulation or we have the scenario where both of these variables correlate within clusters. So why would this be a problem? Why is non-independence of observations a problem and what is it a problem for? Or what is the cause? What does it cause? And as we saw non-independence of observations doesn't lead to bias, it doesn't lead to inconsistency but it leads to less precise estimates. And that is something that we just can't do anything about it. If we don't have much information then we can't estimate things precisely. But that's not really a problem per se because we can just state that well we have an estimate but it's not very precise and sometimes we have to just live with that. The real problem is that if we look at the standard error formula which is derived based on this variance formula where we just plug in the estimated variance of error term here for the sigma and we have a sum of squared total, this equation here only depends on the variance of the error term. It depends on the variance of the predictor variable and it depends on the sample size. If we have clustering effect in the data we saw that estimates will be less precise even if the variance of the error term and the variance of the predictor and the sample size are the same. And this equation doesn't take the clustering into account. So regardless of whether we have five observations that each is replicated 20 times in our data, so in effect our sample size is 100 but it seems that it's one. But it seems to be larger or when we actually have 100 unique observations this formula gives us the same result. And the outcome is that when you have clustering then the standard errors are generally estimated inconsistently and they will be negatively biased. So you will overstate the precision of the estimates and that will cause incorrect inference and particularly it can lead to false positive findings rejecting the null hypothesis where in fact it shouldn't be rejected. So what can we do about this problem? There are a couple of strategies. One is that we use a model that specifically includes some terms in the model that model the non-independence of the error term. Which can be quite difficult to do if the pattern of dependency between observations is complex. Another approach is that we use cluster robust standard errors which will allow you to take an arbitrary correlation structure between observations into account. And that is a very general strategy and I will explain that in another video.