 Missing data techniques can be divided into traditional simple techniques and modern techniques. Generally, the modern techniques are always preferable, but there are some instances where some of these simple techniques may be applicable. Also it's important to understand why some of these commonly used simple techniques do not work. Let's take a look at what these simple techniques do, and we'll start with list-wise deletion, or pair-wise deletion, which are the same in this case. Data that we have come from Ender's, so these examples are from his book, and we have IQ scores here and job performance scores here, and the missing data that we have is in the job performance variables. We can see that the pattern of missingness is such that job performance is missing at random. So the missingness in job performance depends on IQ here, so everything below 99 is missing, but it does not seem to depend on the job performance itself. So this is a scenario where modern missing data techniques would work well, but some of these simple techniques do not. So the simplest possible thing that we can do with this data is to apply list-wise or pair-wise deletion. What that basically means is that we are dropping, in this example, we are dropping all observations with any missing data, and simply estimating the regression coefficient with these data that have complete observations. So what is the difference between list-wise deletion and pair-wise deletion? Let's take a look at this example. So this is our data set generated with R, and we have three variables for observations. We have X1, X2 and X3. If we drop all cases with any missingness, we will drop the first three observations, and we only have one observation remaining, and we cannot calculate any correlations, because you need at least two observations to calculate the correlation. So if we apply a correlation matrix, a calculation here, it gives us no results because it requires complete cases. However, it might be tempting to think that maybe there is something that we can do. So maybe we can calculate the correlation between X1 and X2 using the second and the fourth observation, because we have data for X1 and X2 in those two observations. So we will use this pair of observations, and we just take the pair of variables, and we just take those cases that have complete data for that pair of variables. That's the pair-wise deletion. And then we take another pair for a correlation between X2 and X3. We will be looking at the pair of first observation and fourth observation, and for the correlation between X1 and X3, we will be using the third and fourth observation again. We will use different observations for different correlations. We can calculate the correlation matrix, but this is actually potentially problematic. And to show why it's problematic, let's take a look at what the correlations look like. So we have a correlation matrix here, where X1 correlates perfectly with X2, it correlates perfectly negatively with X3, and then X2 and X3 correlate perfectly positively. This is an impossible correlation matrix. So you cannot have two variables that are perfectly correlated X2 and X3, and one correlates positively and one correlates negatively with the third variable. That is an impossible matrix. No real data set would produce this kind of matrix. So this is one problem of pair-wise deletion. It can produce correlation matrices that are impossible. Then if we try to use this correlation matrix in a regression analysis, I don't even know what would happen. Probably the software would give us some kind of error message that this is not a valid correlation matrix. There are also other problems in pair-wise deletion. Quite often when we do this kind of analysis, we want to report a sample size for the correlation matrix. So what would we report if the sample size varies between different correlations as it will certainly do if the different correlations are calculated from different data? Also if we use that correlation matrix then in other analysis, how would we calculate standard errors for those analysis because the sample size is not one number, but we have different sample sizes for different correlations? Finally, even if we are able to overcome that problem, if we then apply the same data in a regression analysis, the regression analysis will do least-wise deletion for us. Then we would write our regression results and write the correlation matrix into a paper and then someone checks that the correlation matrix and the regression analysis don't match. So the correlations do not reproduce the regressions and this will be because the correlations have been calculated using a different data set than the regression analysis. So this is the pair-wise deletion. It's something that sounds appealing. You're going to be using that data that you have for each correlation so you use the maximum amount of data for each correlation, but it's actually problematic because of these statistical reasons and because it just produces very confusing results because of the sample size issue and because of mismatch between correlations and regression analysis. If you must delete observations, then using least-wise deletion is probably a better alternative than using pair-wise deletion. Let's move on to other techniques. So we can delete data, but we can also impute data or substitute data. And the simplest possible way of imputing data basically are coming guessing what the missing values could be if we had observed them is the mean imputase. So instead of using these data as missing, we make a guess that, well, job performance what's the best guess for a missing value, it is the mean of the non-missing values. So we take the mean of these persons that have the job performance data and we simply substitute that mean for the missing values here. What will now happen if we run a regression analysis? The original regression analysis, if we run a regression of job performance on IQ, the missingness is depending on the X variable, the IQ. So a regression analysis would not have any problems in this case. If we do mean substitution now, we can see that these observations here are systematically higher than the original observations. So a regression analysis will be biased and inconsistent. So regression analysis in this case, because the missingness depends on the X variable, not the Y variable, would not really have problems. We would have consistent and unbiased estimation. If we apply mean imputation or mean substitution, then we will cause a bias ourselves because of the procedure that we apply. Also the variation of job performance will be greatly underestimated because these are not really the same value. There's variance and the variance of the imputed values is zero. So this is a really bad idea, but it's very common because it's easy to apply and some entry level quantitative analysis books recommend this as the default technique. So don't mean impute. This is probably the worst thing that you can do when you deal with missing data. We can do imputation or guessing of the missing values in a bit smarter way. And the smarter way is regression based imputation. So how it works is that we run a regression analysis on these data that we have and then we use the regression model to predict the missing values of job performance. As we can see, that produces job performance values that are not biased in this case. They are roughly where they should be, but there is no dispersion. If we now estimate the regression model using this data, the regression model will be too good because we don't model the variation around this line here. But the regression itself, the regression line would be actually consistent. So what they are, what are the downsides of this technique are the downsides are that while we estimate the mean, the conditional mean or the predictive value is correctly, we don't model the dispersion at all. So in mean substitution we model both the dispersion and the expected value incorrectly. Now we model the expected value correctly, but we don't model the dispersion. So this using these data in any subsequent analysis would be highly misleading. But we can still improve this regression based imputation. We can do what is called stochastic regression imputation. So instead of taking the predicted value, we calculate the predicted values and then we add random noise to those predictions based on the model. And this is actually a pretty good technique. It has shortcomings, but we are estimating the regression line correctly. We are estimating the R-square correctly, the variation correctly, and we are estimating the mean of job performance correctly. So this starts to be a useful technique. But this technique also has some problems. So while we are estimating the actual estimates correctly, if we then use these data in any other analysis, our standard errors will be biased. The reason for this is that these imputations here are not the real data. They are our guesses. And if we apply these data in a regression analysis, the regression analysis would not know that these are uncertain guesses instead of absurd values and would not know that the standard errors need to be adjusted to take that into account. So while this technique produces us reasonably good estimates in some scenarios, the standard errors will be inconsistent. They will be generally too small. We have ways of dealing with that issue, but let's talk about these simple techniques first. So let's take a look at the comparison. Enders presents the simulation study. So he's taking a sample size of 250, 1,000 replications. So this is a Monte Carlo simulation. He generates data sets and then estimates using these different techniques. He has three conditions. One is missing completely at random, where the missingness is just a random process. Missing at random, where the missingness depends on the IQ value, which was the case in these examples, and missing not at random, where the missingness depends on the job performance values. The missing data is always half, and the missingness only occurs in the job performance rating. So let's take a look at these techniques and let's see the evidence for ourselves. And this is a kind of analysis or kind of simulation that will be very simple for you to also run on your computer to see for yourself if you don't believe the evidence. So we had a missingness. Job performance is always half missing. IQ, there is nothing missing. And missing completely at random. So this is the easiest case. We can see that there are a couple of things that are biased. They are bolded in the book, but I highlight them with yellow to make it more visible. So job performance variance, covariance and correlation with IQ are all biased if we apply these mean imputation. So we take the mean, but the mean is not the right predictor for the missing data. And therefore everything that we would be interested in about the job performance would be basically inconsistent and biased. Interestingly, if we have list-wise deletion, there are really no problems. What we are losing is precision, but there is no bias, no inconsistency. Then in stochastic regression implementation, we would have no biases at all. And again, if we apply then a regression analysis on this data, the standard errors are going to be inconsistent. If we have missing at random, then what will happen is that we will get more bias. So missing at random list-wise deletion will be problematic for most quantities. Arithmetic mean imputation also problematic, regression-based imputation problematic, but stochastic regression implementation would work pretty well. So the difference between stochastic regression imputation and the regression-based imputation is how we model the variance. So stochastic regression-based imputation increases the variance of the predicted values by adding the random noise, and that variance then influences also the correlation, which is more correct than the normal regression-based imputation. Then the final case, missing not at random, the missingness depends on the job performance values that are actually missing. And here things go haywire. Most of the things don't work. Basically what we observe, what we estimate correctly now is the IQ mean and the IQ variance, because those are, we observe all those values. In the list-wise deletion, we delete the IQ values that we have, even if we could use those for calculating the mean, and this is the reason why we have bias in the means. So list-wise deletion can produce bias in the means for those variables that we actually observe all values for, that don't have any missing. But generally if you take a look at this pattern, this missing not at random is a scenario that is problematic for all these techniques. If you take a look at this data, then stochastic regression-based imputation would be okay, you just have to remember that standard errors will be biased. And missing data techniques, if you have a missing completely at random, then stochastic regression- imputation and simply deleting the data, both would be okay. Mean substitution always bias results, that's the worst possible technique and regression-based imputation is not much better. It estimates the covariance correctly, but correlation, which we are typically more interested in covariance would be incorrectly estimated. There are other traditional techniques, but these are not very commonly used, so I don't talk about these in detail. General techniques would be hot-deck imputation and similar response pattern imputation. These are basically based on the idea that you try to find cases that are similar to the one that has missing data, and then they take values from those similar cases and impute those values in place of the missing data. They have the same kind of problems that the regression-based imputation has. Then our scale-item level techniques include averaging available items. So if you have a scale with three items and one item is missing, you can use the average of the first and second item to impute the third item. This is sometimes a useful technique. It has some limitations and it probably gets applied quite often without being explicitly reported. If you take a mean of available items, that is the same as imputing the missing data with the mean. Then we have techniques for time series data. There is a last observation carried forward. Basically what that does is that if we have, let's say, revenue data until 2008 and we need data for 2009 and 2010, we will take the revenue value from 2008 and use that for 2009 and 2010 for that case. This will, of course, distort any trends in the data, but it's sometimes defensible if the amount of missing data is very small. Time series interpolation would refer to a scenario where we have data for 2008 and 2010 and 2009 is missing. We'll just simply take the average of 2008 and 2010 and use that in place of the missing value at 2009. This is basically another variant of regression-based imputation and it has the same issues. So the summary. How do these techniques work? If the data are missing completely at random, least was the lesson is okay if the sample size permits. No problem with that. There's loss of efficiency, but if your data, if you have a large sample size that you're going to be efficient enough, then this is a simple thing to do. Of course, applying maximum likelihood estimation for missing data would be simple as well and it would be statistically more appealing. So if that's an option, it should be preferred over least was the lesson. But least was the lesson is not entirely bad. If data are missing at random, then things get more complicated. List was the lesson might be okay if the data is trivially small, if the amount of missing is really small. It would be okay if the missingness is on the X variables but not the Y variables. And in other cases, it would be problematic. Stochastic regression implementation would be okay, but then there are standard errors will be incorrect and you should be doing something about that. Missing not at random, none of these techniques work well. What we have as options are then more advanced selection models, but they too make assumptions that are difficult to justify sometimes. So your best option might be somewhere between a selection model and using techniques that are specifically designed for missing at randomness, and then just acknowledging that your results might be biased. There are a couple of final points that I want to make. Mean implementation, regression implementation and a pair was the lesson should be avoided. I hope that this video makes clear why that is the case. Average available items may work in some scenarios, if your amount of missing data is small, if the items are often comparable, then that might be a useful technique. We should always consider the trade-off between simplicity and precision because simple techniques like averaging available items is less likely to be misused than complex techniques. Finally, all of these techniques are generally inferior to the modern missing data techniques which include multiple implementation and macular market estimation for missing data. The multiple implementation, I'll talk more about that in other video, but it's basically an extension of the stochastic regression implementation where you do the implementation process many, many times over. Tomorrow, they are uncertainty related or the variation related to the implementation process itself and that corrects the standard errors and takes care of the final problem in the stochastic regression implementation. The main takeaway of this video is basically that all these techniques should be used, should have problems and some of these techniques, particularly the mean implementation, should probably never be used, even though it commonly is.