 Data augmentation algorithm is the most commonly used technique for doing the actual imputations in a multiple imputation analysis. You don't need to understand the technical details of this algorithm or the basic statistics on which it's based on, but understanding the basic principle of what the algorithm does and why we use this algorithm is useful because sometimes the algorithm does not converge and then you need to make decisions on what to do about the non-convergence and to make informed decisions you have to know the basics. Let's take a look at what problems the data augmentation algorithm solves. This is the example data set that I use when teaching regression analysis. It's occupations from sensors of Canada. We have a few variables education income women and prestige and we can see that we are missing data on in prestige variable. How would we impute the values of prestige? Well the simplest approach would be to just run a regression analysis of prestige on those variables, those cases that have it on other variables and then use that regression model to predict the values for those cases that don't have prestige. So it's simple enough, but unfortunately most real-world data sets are not this simple. In reality the missingness is really constrained to one variable, but we may have a pattern like this. So we have missingness all over the place and we have more than one variable that are missing. Now we have two questions that this kind of issue raises or this situation raises. One is if we don't have data for the predictors of prestige, so we simply cannot run a regression analysis. So how do we estimate an imputation model if we don't have data to start with? And then another question is how do we calculate the predictions when we don't have data on works for example education? And these kind of questions data imputation algorithms must answer. We have also another issue in how we do imputation. And if we take a look at the stochastic regression imputation from Enders example, what we do here is that we run first the regression model on the all available data, then we use that model to calculate predictions. So this works well when you have missing data on just one of the variables, then you can easily predict that variable using all the others. No that's simple, but there is a small problem here and the problem is that the standard errors from any such imputative data would be biased and the reason is that the imputation process here assumes that this regression line is the population regression line. But it's not actually the population regression line, rather it's an estimate of the regression line and we use that estimate to do the imputations. So when we do imputation and we want to calculate the standard errors for the imputed data sets or the analysis based on the imputed data sets, we should somehow take the uncertainty of this or imprecision of this estimated imputation model into account. In multiple imputation we do that and we do it by using a different imputation model or different set of parameters for the imputation model for different imputed data sets. So in fact is we don't use the same regression line for every imputed data set, but for example if we have 10 different imputed data sets, we use 10 different regression lines. How do we then come up with those 10 different regression lines or come up with a regression line in the first place if we have missing data in more than one of the variables, this is something that the data augmentation algorithm provides us. So the data augmentation basically solves this chicken and egg problem that to do imputation we need to have a regression model. To calculate a regression model we need to have data, which we don't have before impute. In practice this is an iterative algorithm, so we do many many many imputations and we have two steps. We have the imputation step. So we start by with some covariance matrix and mean vector for the data. These are estimates. They typically come from an application of the EM algorithm for missing data. So we have an initial estimate of a covariance matrix, initial estimate of a mean vector. We use that covariance matrix and that mean vector to calculate the imputation model or set of imputation models for every missing data pattern that we have. So let's say that we have three predictors, three variables and we have a pattern where just one is missing and we have a pattern where the first two are missing. We in practice we would impute the first variable twice. We would first impute the first variable using the pattern where we have data for the second and third variable and then we would impute those cases that belong to the second pattern where there is missing values on the first and second variable using the third variable. So we would use the same covariance matrix to calculate different regressions for imputation depending on what predictor variables we have available. So this is the imputation step, the I step. After we are done with the I step, we move on to the P step, the P is called the reverse to posterior and in the P step we have one imputed data set to start with. Then we calculate a covariance matrix and a mean vector from that data set and that's our estimate. However, we also need to take into account the estimation error and this goes a bit into the Bayesian statistics side but in practice we take a mean vector and a covariance matrix that is close to the one that we have. So we basically have an estimate of the covariance matrix and then we take one that is close by and the closeness is basically determined by the sample size. So if our sample size is small then our covariance matrix that we take from the posterior distribution can be different from the estimate. If the sample size is very large then the posterior covariance matrix is going to be very similar to the estimated one. And this allows us to take estimation error into account. So we take an estimate of the covariance matrix and then we use that to randomly draw another covariance matrix that takes the estimation error into account. And that covariance matrix then feeds into the next I step. So we start calculating the regressions for imputations in using the covariance matrix. In practice this produces a chain of imputations. So we have an imputation chain, we start with the I step, the P step, I step, P step, I step and P step. And the imputations from the I step serve as data for the second P step. The covariance matrix from the P step serves as input for the next I step. And this causes a situation that the I and P steps are not independent but they depend on the previous I and P steps. So there will be autocorrelation in the imputed data. In practice what this might mean that if you impute a high value for a set of high values for one variable, then the mean for that variable will be very high in the next covariance matrix or mean vector. And then that will also cause the second imputation or the next round of imputations to impute a high mean for that variable. And this autocorrelation is something that is unavoidable and it's also undesirable. To do away with the autocorrelation what we do is that we allow the imputation algorithm to run many, many iterations before we take another data set. So this is in the step 5 and our step 6. We might have a chain of 1000 imputated data sets. And we store only every 100 of those data sets. And because there are 99 data sets between those data sets that we actually use, then we can be pretty sure that any autocorrelation has dampened out so that the data set number one and data set number 101 would be independent. So that's the idea. Unfortunately, there are scenarios where this idea does not work. And this relates to the convergence. Because the IMP steps depend on each other, there is sometimes these long-running dependencies. And if there are long-running dependencies, if there are trends, for example, this is 2000 cycles of imputing job performance mean, we can see that there are trends that go up and down, there is a dependency over time, and this means that there is no convergence. So convergence occurs when all these trends cease to exist. And how we can get convergence or how we can do diagnostics for convergence is that we can do plotting. There are, of course, statistical techniques as well, but plotting allows you to understand the issue more easily. So we can simply plot the means and covariances of every variable over time or over the imputation and check if there are patterns in that visual presentation. We cannot beyond, besides plotting the actual imputations or actual means of the imputations, we can calculate the autocorrelation over time and do this kind of autocorrelation plot. So this is a plot of autocorrelations to function of time. In a convergent model, autocorrelation will be high for, let's say, a few, 10, maybe 20 imputations, but then it should go to approximately zero for, let's say, if you have 100 datasets between two datasets, then those two datasets should probably not be correlated. So if this lag of autocorrelations to function of lag does not go to zero for a parameter, then it means that that parameter has not converged. So we have these graphical techniques and then we have some diagnostic statistics that I will not go into in this video. How do we deal with these non-convergences? Well, there are a couple of reasons, things that we need to understand about non-convergences. It often occurs if you have variables that are highly correlated. If we have highly correlated variables, then that causes the rigorous and poor efficiency to go up for some of the variables in the prediction model and that may cause this kind of situation. To deal with that issue, we have two options. If we have a large number of auxiliary variables in the model, we might consider dropping some of those. Another thing that we could apply is called RIDS prior distribution. The idea of RIDS prior distribution relates to basin statistics again, but conceptually, it would mean that instead of having a covariance matrix of the data to serve as an input for the next I step, we would calculate the covariance matrix from the data after we add a few random observations to the data. This basically makes all covariances to be a bit smaller, and when covariance is a bit smaller, then imputation algorithm converges more easily. These are two techniques, less auxiliary variables or apply RIDS prior, and there are also others, but these are the most common things. In practice, it is important to check that the imputations have converged. Where you actually encounter these convergence problems, that's another matter, but it's important to report that you did not.