 Multi-level modelling software typically supports different kinds of error-covarian structures. These error-covarian structures are quite commonly also discussed in various guidelines that you find in books and articles. So what is an error-covarian structure? To understand what error-covarian structure means and what impact it has on the analysis, we need to first understand the maximum likelihood estimation of a normal regression model. My course is I have the students complete this kind of Excel sheet, and estimate maximum likelihood estimates using Excel's optimizer. Estimets are here, so we have the error-variants here, and then we have the regression coefficients betas, and here is our data. We are using education, women and prestigious, explanatory variables, income as the dependent variable, and the fitted value is obtained by multiplying the values of education, women and prestige with the corresponding coefficients and adding the intercept to the model. So how do we calculate the likelihood and how do we calculate the log likelihood, which is maximized? The idea of a likelihood calculation for an individual observation is that we first make the assumption that the error term is normally distributed in the population. Then we basically take a normal distribution for this first observation. We plot a normal distribution centered at the fitted values. We take the dispersion or how wide the normal distribution is from the estimated error-variants, and then we check, okay, our observation is here. Government administrators earn about 12,300 Canadian dollars. They are expected to earn 11,000, so how likely this observation would be. We go up to the curve, we read the value 0.0014, and that is our likelihood. Then we take the log because logs are easier for a computer to work with than the raw likelihood. And we proceed using this approach and we go through all the observations. So for general managers, we have the probability, the distribution. We plot it. The mean is that 11,487 general managers earn more than 25,000, so they earn a lot more than what they should based on the model. And then we check, okay, so this is very small likelihood. And the log metric is minus 25 in the raw metric rounds to zero with our computers. Then we multiply these likelihoods together, or in practice we take some of these log likelihoods and that gives us the full likelihood of the model here. So that's the log likelihood that our computer reports. So this is the maximum likelihood estimation of normal regression model. How we do the actual estimation is that we have some starting values for the coefficients here, and then we tell the optimizer to find the regression coefficients and take this log likelihood as large as possible. In practice it's always negative, but it gets closer and closer to zero as the model feed improves. This makes critically the independence of observations assumption. So it works because these are independent probabilities or independent likelihoods. So the value of government administrators does not depend on the value of the general managers. Independence of observations assumption allows us to multiply these likelihoods or sum these log likelihoods together. However, in random intercept models we don't have independence of observations. There is an unobserved effect in the random part that is shared by, for example, government administrators and general managers. So the random part, the variation or the regression line would no longer be independent. So how do we deal with that problem? The way to deal with that problem is that we use multivariate normal data. So this is a bivariate normal distribution. So the idea is that if we have two variables that follow multivariate normal data, both of them are normally distributed, but they are correlated. So that if we take a random draw from this population and we receive a high value of y, then it's more likely that we receive a high value of x. So if a value of four was observed for y, then the expected value for x would be two. So they are correlated. So the probability of one depends on the probability of another one or the observed value of another one. How do we then take into this, how do we use this principle in maximum likelihood estimates? Instead of looking at each observation at the time, what we do conceptually is that we estimate an error covariance matrix for all the error terms. So the idea is that these error terms are no longer independent. So this is on the diagonal is the variation of the error term and these off diagonal elements are the covariances within clusters. So these are now three clusters, four clusters, three observations each and the observations are allowed to be correlated within clusters, so they co-variate. And the amount of these covariances is simply on the random effect variance. How we estimate is that instead of looking at the univariate normal distribution and the likelihood from that distribution, we tell the computer that we have these 12 variable multivariate normal distribution characterized by this covariance matrix and these means of fitted values tell us the likelihood of these observations given these estimated covariance matrix and mean vector and then the computer gives us the result. So the computer can estimate the probability. It will know that the probability of getting minus four plus four is a lot smaller than getting plus four plus four if those variables are correlated. It can do that kind of calculation for us. In practice this would be very difficult to calculate if we have a large number of observations. So we simplify, we know that these observations are independent between clusters so the error terms don't correlate between clusters so we can calculate the likelihood of one cluster at a time. So this is the, we compare these three values of y against multivariate normal distribution with three variables and three means and this covariance matrix. That gives us the likelihood of that cluster and then we get the likelihood of every cluster. We sum those together and we get the full model likelihood. So how do then these error covariance structures relate to this estimation approach? Well normally when we have a random intercept we get this constant covariance between observations within a cluster. That is the effect of the random intercept. We can also specify other things. So this is the random intercept model. The idea is that the diagonal is the error variance here plus the random effect, random intercept affects all observations of the same cluster. But that's not the only thing that we can do. Another commonly used error structure is the autoregressive errors. So the idea of autoregressive errors is that the error term of one observation depends on the previous time point and that depends on the previous time point. So we can see that observations between time one and time two are highly correlated. Time one and time three are less correlated because we raised the correlation of the second power. And also now again between cluster correlations are zero. We can do all kinds of combinations with this. Typically you always have the random intercept, the RE here, but you can have the random intercept and the autoregression as well as long as you have enough observations. These kind of error covariance structures are pretty commonly used in empirical research. This is from House Nets paper that uses the BLEASE employee hard to assignment to modeling workflow. And that workflow concludes with assessment or comparisons of different error structures. And what you do basically is that you consider which error structures are theoretically justified. So typically autocorrelation is something that you should always consider if you have long internal data. So at least that should be compared. Then maybe you have a heteroscedasticity. Maybe you have some other weird structure in the correlations. So you consider which of those correlation structures that are appropriate for your data and then you use all of them and then you basically pick the one that fits the best based on the likelihood ratio test. This unstructured basically means that all correlations, all the errors are freely correlated. There is no particle of pattern. Typically that would be like a fallback option if we can't use a theoretical motivated pattern such as autocorrelation AR1.