 Missing data mechanism refers to the process that generates the missing data. It answers the question of why the missing data are missing. When we deal with missing data, there are two things that we need to consider. There is the pattern, what data are missing, and then there is the mechanism, why the data are missing. In this video I'll be looking at the mechanisms. When we deal with missing data, there are certain scenarios where the pattern dictates that we don't need to worry about missing data as much as in other cases, but more importantly, the mechanism determines what kind of tools we need to apply. Missing data mechanism is a lot more important in most cases than the pattern. It is also something that we cannot fully test. The pattern is easy to see from the data, but the mechanism requires some theoretical understanding. What are missing data mechanisms? There are three labels for missing data mechanism. These are from Rubin's work. These labels are a bit unfortunate because they are not very descriptive. You really need to think this through to understand what these mechanisms are. Missing completely at random refers to a case where the missingness does not depend on anything in the data. Missingness is a purely random process, and in this case the only consequence is that there is less data for us to analyze, because the cases that are missing is basically a random sample of our full data set. Missing at random refers to a missingness that can depend on some other variables in the data, but not the value that is missing. Missing not at random is the worst case, and in this case the missingness depends on the actual missing value. For example, how much your salary is can determine whether you go to work or not. If you don't work, then your salary is not observed. This is a classic case of missing not at random. To understand the difference between missing at random and missing not at random, let's take a look at how Anders explained this. He has this example of job performance measures. We have missing data, we have IQ measures here, and here is a scatterplot of the data. From the missing data, we can see that every person with IQ of less than 99 has missing data for job performance, whereas others have job performance data. Is this missing completely at random, missing not at random, or missing at random? We need to understand the mechanisms. Anders explains the mechanisms using this figure here. The first is missing. Let's start with missing completely at random because that's the easiest to understand. Our here is missingness. The missingness depends on something we call something as Z. The Z variables are unobserved and they are not correlated with IQ or job performance data that we have. In missing completely at random, the cause for missingness does not depend on any of the variables that we observe. From the perspective of our sample, missingness is a completely random phenomenon. This is the easiest case to deal with. Then we have missing at random. In this case, the missingness of job performance could depend on IQ but not on the actual job performance values. For example, we could decide that in this case, the company actually decided to only hire those people with high IQ and not the others, so the missingness depends on the IQ but not on the actual job performance. The worst case would be that the missingness depends on IQ but more importantly on the missing job performance values. For example, if we fire those people who are performing poorly, then if we would have measured those people, their job performance ratings would be lower but we are not measuring them because they are performing poorly in their job. This is the worst case. Missing at random can be dealt with. Missing completely at random is not problematic. Missing not at random is something that causes problems. We get some ways of dealing with that issue but our options are fairly limited. Is this missing completely at random, missing at random, missing not at random? We can first rule out missing not at random. Missing completely at random because we can see that there is a pattern where the IQ score determines whether job performance is missing. So for everyone with IQ less than 99, job performance is missing so there is a clear statistical association. So this is either missing at random or missing not at random. So which one is it? We need to think about does the job performance that is unobserved here, does it determine the missingness? Well, we wouldn't be able to do that in an empirical data set because we don't have these observations. But in this simulated data set we actually have the complete data and we have the data set with the missingness. We can see here that there is no clear pattern that job performance, it is lower yes for this where it's missing. But this is probably just explained by the IQ differences. If we look at this job performance IQ relationship we can see that job performance goes up as IQ goes up. So we could probably conclude that the missing values in job performance don't really add any more information to the missing data process. So the fact that these missing job performance values are lower does not, the missingness does not depend on these low values but rather these low values depend on the IQ and the IQ cause the missingness so this would be missing at random. Understanding these concepts and the difference between missing at random and missing completely at random is pretty fundamental for understanding how we deal with missing data. Let's take a look at consequences of missing data. I have another video of these so this is just a brief overview. If we have missing data then less information means less precision and this applies to all missing data mechanisms. So regardless of the mechanism less data means less efficiency. Missing completely at random, this missingness is a random process so there is no bias, there is only loss of efficiency. If there is a systematic pattern in that the missingness depends on other variables that are observed then unless we apply modern missing data techniques we will have also biased and inconsistent estimates. So simply dropping this observation with missing data because this is missing at random running regression analysis we would potentially get problematic results. This depends a bit on the pattern but generally missing at random can cause bias but more than a technician can compensate. Missing not at random. This is problematic. There is bias. We can have some control with selection models but these models basically make assumptions that are empirically untestable and these techniques are also sensitive to violations to assumptions so if we have missing not at random then our options are fairly limited. So people use selection models quite a lot but their usefulness is a lot less limited than what researchers normally acknowledge. How do we test for missing data? Missing not at random cannot be easily tested because missing not at random basically shows that the missingness here depends on the missing value and if we don't observe the missing value we can't say if the missing value causes the missingness. Missing at random versus missing completely at random these can be tested. So we can know that the data is either missing at random or missing not at random or it's missing completely at random or missing not at random. So we can differentiate between MAR and MCAR and how we do that the simple approach is to do a t-test. So we can simply split the data into two based on the missingness. So for example here we would have job performance. We split the data in two. We have the cases with missing job performance. We have those with non-missing job performance. Then we compare the means on IQ. So with t-test we conclude that the cases that have missing data are systematically different. The IQs are lower than those with the full data and therefore the data cannot be missing completely at random. It must be missing at random or missing not at random. If we have many patterns of missing data, many variables, this requires a large number of tests. So you might want to do a multiple comparisons correction such as the one fair only correction. Then there's another test. There is little MCAR test which basically conceptually it does all the t-tests at the same time and gives you a single test statistic. I will not go into details into that test because understanding it's not very important but it gives you one p-value that tests for MAR versus MCAR and it's available in most commonly used statistical software.