 Autocorrelation refers to a correlation of a variable with itself over time. Autocorrelation is something that is very commonly present in long-and-to-no datasets. Autocorrelation can also be confused with other features such as time trends or unobserted or genetic that are also present in the same dataset. Let's take a look at what autocorrelation is and how it differs from time trends and unobserted or genetic and how autocorrelation should be dealt with and what are the implications in having autocorrelation in your dataset. So let's start by looking at what is normal just random noise. So normally in a regression analysis if we have a pooled OLS regression so we're basically taking a longitudinal dataset and then we put it all together. We run with OLS regression, we ignore any time dimension, any clustering. What kind of assumptions are required? We need four assumptions about the error term in regression analysis. So the four are independence, no endogeneity, homoscedasticity and normal distribution. And these assumptions hold here. So this is just random noise from normal distribution and this is what regression analysis assumes if we use a normal regression analysis on a longitudinal dataset. So what does the independence here mean? It basically means that the value that we observe in time one does not depend on any of the past or future values. So what we have observed in the past or what we will observe in the future does not in any way influence a particular observation. So this is simply random noise. You can see it goes up and down. It could be up for a while then come down but that is improbable because we just take a random sample. It's kind of like a random value. It's kind of like throwing a fear die when you have thrown the die many, many times. If you have gotten like three sixes it does not imply that getting a fourth six would be more likely because it's random. So this is our independent and identically distributed. The identical distribution means that the variance here is the same as variance over there. So the variance does not change over time. The shape of the distribution does not change over time. So everything is random and everything is pretty much identical. So we could switch the observations around and then the statistical nature of this data would be the same. So the time ordering has no implications on the data. So this is independent and identically distributed. And this is autocorrelated data. So autocorrelation means that the value of a variable strongly or weakly but depends on its past value. So stock indices are a good example. We can see the stock index is not like random noise like before. So there's of course a time trend which is different from autocorrelation. But we can also see that there are these trends that goes up and down, up and down. So there's a peak here and there's a slump here. So when there is one bad day then it tends to be followed by other bad days. It's not like you get one bad day and then it comes back up again. And the same thing when there's an up turn in the economy, the stocks tend to be high and they don't just randomly go up and down, up and down. So this follows what we could call a random walk basically. So we just add things to the previous value. It's not actually a proper random walk because it doesn't diverge. But basically the stock goes up or down some percentages from the previous value. So this is autocorrelation. The current value depends on the past value. And why is autocorrelation problematic? Or is it a problem? Well it's at least a feature of the data. To understand autocorrelation we need to understand a bit more about the nature of the autocorrelation. And autocorrelation has an order. The typical cases that we consider are the first order and second order autocorrelation. The first order autocorrelation simply means that the value of y2 depends on y1. And controlling for y1 it does not depend on any past or future values of y. In the second order case the value of y2 depends on y1 and y0. And controlling for y0 and y1 it does not depend on anything else. So here y3 and y0 are related. So they are correlated but only because the effects of y0 are mediated by y1, y2 and y3. So the AR1 autocorrelation here is the simplest possible way of modeling autocorrelation. And it is also the most common because of the simplicity. And quite often there is no good theoretical reasons to believe that the effect is something. Other than AR1 we can check, compare AR1 and AR2 if there is an effect that jumps over one year. So this is basically effects, if these are annual data effects go from one year to the next. But controlling for the last year what the value was the year before it wouldn't have an effect. So effects only go from one to next observation. They don't jump over observations. Second order autocorrelation on the other hand means that the value depends on the past two observations. We can have more orders but quite often how people model autocorrelation is that we default to AR1. If we want to be more rigorous we test AR1 and we test AR2. And if we conclude that there are the beta Z1 effect here for the two year lag is non-significant then we would go for AR1. So we can do comparisons and check what is the simplest possible model that is adequate for the data. Autocorrelation can be confused with other trends of time or other persistence over time in the data. So in neither of these cases do we have autocorrelation. So the first case is simply we have a blue line and red line and they are consistently different from one another. This is not autocorrelation. The fact that this is always higher blue line is always higher than red line simply means that there is unobserved heterogeneity. So the values of blue line don't depend on the previous values of the blue line controlling for the mean value of the blue line. The same goes for the red line. So autocorrelation is in a way it's about trends. It's not about persistent differences. It's about something going up and down, up and down and then depending on its past values. But if there is this constant effect then we wouldn't refer to as autocorrelation. Of course if we calculate an autocorrelation from a data with unobserved heterogeneity then our autocorrelation statistic would indicate that there is autocorrelation. But it's basically the model is misspecified if we don't include the unobserved heterogeneity in the model. This is another case that is commonly confused with autocorrelation. So this is a time trend. So we have a linear trend of time. The effect always goes up and in this case the observed values of y don't depend on the previous value but they depend simply on the time index. So this could be also confused as autocorrelation. Empirically if we have this kind of data set we estimate autocorrelation without modeling the trend in the data then we are misspecifying the model and the estimates are going to be biased. Our model will indicate that there is autocorrelation where in fact it's not autocorrelation. It's simply a linear trend. So we need to understand these differences. Linear trend means that the observations they tend to go to the same direction not because they depend on one other but because they depend on time. So autocorrelation is that the dependency is on the previous value. It's not that there is a time trend on which the variable depends. And here unobserved heterogeneity means that there are unobserved causes of differences that are shared by all observations. It is not that the observations depend on the past values. So these need to be distinguished empirically and also conceptually. Autocorrelation and variation over time are related. And if we look at longer time series autocorrelation we can also see something interesting. Let's take a look at the variance first. So the first data set has an autocorrelation that is pretty strong and the second data set does not have autocorrelation. So this is a normally distributed noise. Interestingly the variance of both these time series which are artificially generated is exactly one. Now if we split the first time series of 60 observations into 10 chunks or 10 periods of 6 years each and we calculate the variance of each period. We can see that the variances of within period are much less than one. And in this random no autocorrelation data independent observations these variances of these shorter time periods are one on average. So if we don't model the effects of autocorrelation particularly in short time series data. Then we run into the risk of underestimating variance components. So if we only look at 6 years of this data set we are going to estimate that the variance is about 0.5. So that's the mean of these variances of these periods. So if we ignore autocorrelation we have a short time series. We are underestimating how much the observations actually vary. That does not occur in the independence of observations case. So that's one implication of autocorrelation. Another interesting feature that we can see from these graphics is that in the independent case the value of one observation does not depend on any other. So that's the definition of independence. But also if we take a look at the autocorrelated data set then an observation that is far away from the initial point. So initial point is very low and that means that the first 6 years is very low. It starts to go up but it's below the mean for like the first 10 observations. If we go way here to observation number 60 then the observation number 60 is so weakly dependent on the first observation. Because it has had time to go up and down, up and down many many times. That we can say that in practice the observation number 60 is independent from observation number one. So time autocorrelation means that observations that are close to one another in time are not independent. But if we have long enough time period between two observations then those observations are in practice independent. And that can be leveraged in some analysis techniques. Some practical advice on autocorrelation. Autocorrelation of error term can lead to biased variance estimates. So basically your regression estimates will be okay but your standard errors will be too small. Autocorrelation can be confused conceptually empirically with unobsturated originality and time trends. And this is important to first understand conceptually how these three things differ when you deal with longitudinal data. Then assumptions about autocorrelation or no autocorrelation in the models they always are about the unobserved parts. So if you have the error term then assumptions about autocorrelation are typically about that error term and that error term only. The fact that your explanatory variables autocorrelate does not have any implications for your analysis. Because in a regression analysis the explanatory variables are fixed. We take whatever values we have and we use that and we don't make any assumptions about distributions or independence of those values. We only make assumptions about the error term. However calculating autocorrelation from opposite variables can be useful, particularly the dependent variable. If we show that all our explanatory variables are, don't have autocorrelation, the dependent variable doesn't have autocorrelation. Then we can pretty much conclude that the error term probably does not have autocorrelation either. On the other hand if all the variables that we observe are strongly autocorrelated then it would be unreasonable to assume that the error term is not autocorrelated. So generally if those variables that we observe are autocorrelated then those variables that we don't observe that go to the error term tend to be autocorrelated as well. Autocorrelation can be dealt with in a couple of different ways. We need to understand what is the impact of autocorrelation. If we can assume that the random part, the error term and any other random effects are uncorrelated with the fixed part then autocorrelation only has an effect on the standard errors but not on the consistency of the estimates. So we can use cluster of the standard errors to deal with that issue. Of course if we do diagnostics for autocorrelation the simplest possible technique or solution is simply if we can show that the existence of autocorrelation in our model or data is unlikely. So that's the first strategy that you usually go for. Then finally if our model is more complex then we may want to add autocorrelation as a component to the model. For example error structures in multilevel models if you have panel data models with latent variable modeling like a cross lag model you could add autocorrelations there. Of course that would run into identification issues but it's possible to add autocorrelation to the model. In practice dynamic panels lead to tricky endogeneity issues. So if you have a correlated error term then it really becomes a problem if you have the y from the previous time point as a predictor of the current time point. So this is the biggest concern that you may have. In other scenarios autocorrelation is pretty easy to deal with because it doesn't lead to endogeneity. You can just switch the cluster over standard errors you're going to be fine if you have large sample size but if you have a dynamic panel then this leads to inconsistency unless it's a problem model. In practice autocorrelation is often recommended or considered in multilevel modeling guidelines. So this is a step four I believe it's a final step from Lee's employee hearts modeling approach. And they recommend that as a final step once you have constructed your model use a test alternative error structures to see if the errors are autocorrelated. If they are then apply autocorrelate errors to your data if they are not then you should not include autocorrelation of the error term. And finally here is an example of diagnostics deep house applies some simple techniques. And if you're more interested in techniques then you can go and check Green's book or Kennedy's book deep house is kind enough to give us the page numbers where we can check and read about these tests. And deep house concludes based on these tests that autocorrelation of the error term is unlikely to be present in the data. And therefore it doesn't really need to be taken into consideration. You can just go and do normal regression analysis. So this is the simplest way of dealing with autocorrelation but unfortunately autocorrelation is actually quite common in longitudinal data sets.