 Multipleinputation on modern mising data-technique. Tämän teknikin ja maximum likelihood estimation for mising data on the two techniques that really should be applied in any mising data scenarios. In this video I'll talk about the principle and the basic idea of multiple inputation and I'll go to some technical detail in another video. These technical details are important to understand if you apply multiple inputation, but just to get started you need to understand the idea. Multipleinputation is a modern technique, but sometimes researchers or reviewers object to inputation on the grounds that inputation is basically a coming up with artificial data. And we should not use artificial data, but we should use real data and not make any guesses about data that we don't have. This kind of criticism is unfounded for a couple of reasons. First, most importantly, the idea of missing data-imputation is not to generate valid estimates of specific case values. Rather, this is a tool for calculating relationships between variables. So whether the individual estimates for individual case values are correct or not is basically irrelevant as long as we get the correlations and other statistical associations right. The second thing is that this technique, while it sounds like cheating, it has actually been proven to be consistent and produce better results than simply using whatever data that you have and deleting list-wise those cases that don't have the full data. When I took my first course in structural legacy modeling that I was told by Todd Lill, he talked about multiple-imputation. And he said that he has encountered these opinions that multiple-imputation or doing missing data analysis is unethical because you are coming up with data that you don't have. And he said that he does not agree with that you. He thinks that it is unethical to not do so, because when you do list-wise deletion, which is basically the other option that you would often use, you are throwing away data that you have. And it's always better to use data, even if incomplete, than to throw away the missing data. So this is just to set the states on and to understand that these are sometimes objected, but the objections against these techniques are really based on misunderstandings and not understanding what multiple-imputation is about. Let's now go on to the multiple-imputation technique. So in my previous video about the simple and traditional missing data techniques, I showed that the best among the simple techniques is the stochastic regression-based inputation. In stochastic regression-based inputation, we have the regression line that we fit on the data that we have, we calculate predictions of data that we don't have, and then we add random noise based on the fitted regression model. This had the problem that it does not take the uncertainty or estimation error in these predictions into account. So these are not actual individual values that we would like to have, but they are instead of from a distribution. So we would like to model that the data are distributed around the predicted line, but for practical reasons we need to estimate individual points. So how do we go and solve this problem? The multiple-imputation basically solves this problem by doing the stochastic regression-imputation many, many times. So if we generate 100 data sets with different imputations, then these points here would be roughly normally distributed, and that would kind of resemble the population from which we would draw the data. So this is the first stage. We do an imputation stage. We impute multiple different data sets, every data set is different from one another. Then we analyze every data set separately using the analysis technique, and the analysis stage does not differ in any way of the analysis of full data. So we would apply regression analysis like we do normally, and we would just apply the regression to 100 different data sets, or how many we impute. After we have these, let's say 100 analysis results, then we have a summary stage where we have the final results, we call it the pooling stage, and if we do regression analysis, the final pooled regression coefficient would be a mean of these repeated analysis, and then the standard error would be calculated using this kind of formula. You don't need to understand the idea because this is something that your software will take care of you, but it's useful to know that this exists if you need to, for some reason, do the pooling phase yourself manually. For example, if you apply an imputation technique that your statistical software does not support. So this is a very simple idea. You impute multiple times, you analyze the imputed data sets, you take the mean of those estimates, and that's your best guess of the final estimate, and then you calculate the standard error based on the variance of the estimates between and within estimation. So the within is the normal standard error, between is how much the variation is because of the imputation process, and that gives you more precise standard errors than simply using one of these data sets, and there are standard errors from that. A couple of points that we need to understand is there is lots of complexity in the actual imputation states, but there are a few principles that are general to all possible ways of imputing the data. One is how large should the number of imputations be, and this required recommended number of imputation has been going up with the computer power because we can now easily run 20, 100, even 1000 imputations without much problems. So most texts recommend between 20 and 100 imputations. More is always better, but there seems to be a consensus that after about 20 imputed data sets, then the gains start to be very small. So the recommendations of how many imputations you should do are between 20 and 100. You can always try different values and see how it works, and if you can do 100 imputations in a minute, then there is really no reason to do less, because the only reason you would do less is to save time. A couple of final points that you need to know generally about multiple imputations before we go into the specifics. And this is a simulation-based procedure, and the purpose is not to recreate any data sets, not to recreate any values for any particular case, but to model the data and model the missingness in a way that allows you to estimate the relationships between the variables consistently. Multiple imputation works if your imputation process, imputation model, I'll talk about that concept a bit more later in another video, if the imputation model takes into account all the features that you have in the data. So if you have cluster data and you want to use a data analysis that is appropriate for clustered data, for example, multi-level model, then that clustering needs to be taken into account in the imputation process. So the imputation model would not be the simple stochastic regression model, but it would be something else. A small number of imputations, sometimes okay, 5 to 20, but if you have the computer resources with most people nowadays, have, then go to 100 or more. The final point about the imputations is that these imputed models or imputed results, they are generally valid only within that imputation. What that means is that you get a likelihood ratio test, a likelihood statistic, for example, from the imputations. Those statistics should not be applied in a likelihood ratio test outside the pooling process. The reason is that any of these statistics that quantify variation or quantify uncertainty do not really the individual statistics unless they are calculating use the pool procedure. They don't take the uncertainty and imprecision because of the imputation process in the account. So you really have the results from the pooling stage and then post estimation would be something that you generally wouldn't do. If you need to do model testing, then you need to build that model testing inside your pooling process, but that might require some programming and might be difficult to do. So this is a useful technique and the idea is basically to do stochastic regression based imputation, run that many, many times aggregate the results, take a mean of the estimates as your estimate, take a standard error considering uncertainty in imputation and the estimated standard errors from the individual imputations and that's your final standard error. The imputation process itself can be a bit complicated and there's some technical issues because you need to ensure that the imputed datasets are independent and also you need to ensure that the imputed datasets capture all the relevant information that you have for your analysis. But if your imputation model is correct and if your analysis model is correct, then multiple imputation produces you consistent and in large samples unbiased results under the missing at random assumption.