 Hi, I'm Dr. Justin Essary and this is week 10 of Polypsi 506 Bayesian non-parametric and computational statistics and this week we're going to talk about the problem of missing data. It's a very common problem that if you've done some research you've probably already encountered in your in your applied data sets. And so this is a topic that's almost surely going to come up in your day to day life as a statistician or researcher. Fortunately and unlike past years we now have much more sophisticated ways of dealing with missing data problems than we did before and these statistical methodologies can help alleviate some of the problems that exist when your data is partially missing. But before we get to those solutions I just want to start off talking about what missing data is and why we would be concerned about it. So as I said before it's extremely common for observations to be partly or completely missing for data sets that we care about and this is particularly true of observational data sets that are in other words data that aren't coming out of an experiment but rather are coming out of the field. So for example most international relations data sets, IR data sets don't happen, don't come from a laboratory, they come from the world, they come from statistics agencies in various countries, they come from human rights or groups or other NGOs that are collecting information about countries of interest and as a consequence of this sort of real world nature of the data it's very common for there to be problems with that data set. So here's a little fake data set that I made up with the observations in the columns here and the variables in the rows and it's really quite often you'll find for example that some observations have observations that are present for some variables but not others. So for observation 2 I know y and I know z but I don't know x. Maybe I know that y, probably the dependent variable since y is usually short in for that, but I don't know some of the predicted variables or the predictive variables, the independent variables. Maybe I don't know the dv but I do know the independent variables. And then finally maybe I don't know anything, I'm missing all the values for the case. All of these patterns are patterns that you'll see in real data. And you know the question becomes what do I do about that? Well and why do I even have to worry about it? Well it turns out that first of all as you probably already know if data are partially missing you can't estimate a model on them without doing something about it. So for example if I was trying to run a regression of y is a linear function of x and z on this data set I would have one observation because I only have one complete observation data set. I can't fit coefficients to x and z for the cases for which I don't have both x and z observed or if I don't have y observed. So in other words I would have to just get rid of all four of these observations here 2, 3, 4 and 5 if I was going to try to estimate a model on them. That's kind of discouraging throwing out data because it makes for a lot of reasons. If certainly it's going to make your n smaller it's going to make your standard errors bigger. And you know conceptually missing data is partially missing often but it still contains some information. So for example observation 2 sure I don't know what x is but I do know what y is and I do know what z is and so there's still information that maybe I could extract out of that observation that's kind of important. And if I just try to estimate a model on something with partially missing data I'm going to lose that information because I cannot as I as we said I can't estimate a model on data that's partially missing. Simply removing those missing cases as we you know often do at the very least it's going to make n smaller and standard deviations bigger. That's the best case scenario. In the worst case scenario it can actually cause bias in estimates if for example the data are systematically missing in some way the missingness can create an effect a selection bias effect a selection bias problem in the worst case scenario. So we don't necessarily just want to ignore the missingness we want to do something about it. We don't want to throw out this data set we want to fix it we want to make it better. So if we have no choice but to use a data set that has missing variables in it how do we do that in a way that minimizes the threats to inference that inferences that we draw from those those data sets. Now you know one way of dealing with missing data is you know don't collect missing data you know make sure you get it all. That's fine particularly in experimental scenario and that's definitely a goal to strive for. But if we are forced to use missing data sets which we very often are what do we do in that case. Well that's the subject of today's lecture so let's find out. So one important fact about missing data is that not all missing data is created equal. There are different kinds of missing data that have different consequences for inference and there are three basic patterns of missingness that I think that most people talk about and that I want to focus on today. The first is data that is missing completely at random or M car data that's missing completely at random is in short data where the occurrence of the missingness which is to say the fact that the data is missing or not is as you would think completely random. It's not related first to the value of the variable that's missing secondly to the value of other variables in the data set and thirdly it's not related to the pattern of missingness in other variables. Now all three of those things are equally necessary and equally important. It's important that in short the data not be related to there not be any kind of selection effect where big or small values tend to be omitted. It needs not to be the case that for example when some other variables high we're more likely to not observe another variable and as you might expect or I should say one other thing it can't be the case that when one variable is present or absent other variables are also likely to be absent and as you might guess this pattern of missingness is let's say hopeful to say the least because all three of these assumptions are a little bit ridiculous in the sense that they're probably not true in cases where we actually see missing data. It's not the case for example that data are just dropped at random from a data set. They're probably missing for a reason. One very trivial idea here is this pattern of missingness in other variables. If data for example are missing because there were no resources to collect them or if a regime is trying to hide some of its statistics or something it's very likely that they're just not going to report any of them. It's not the case that some things will be there and some things will be gone. It'll be the case that they'll all be gone or at least more of them will be gone. So if this has to do something with resource availability or mistakes it's very likely that data that are missing are going to be missing together and that's a very common missing data generation procedure. This other idea that missingness is not related to the value of the absent variable or other variables in the data set is also kind of unlikely because if data are, if missingness is a process, which is if missingness is deliberate or at least systematic for some reason then it's not likely that the system would produce a random outputs. Just as a trivial example again, we would probably expect authoritarian or maybe communist regimes to mess with their reported statistics a little bit to make the regime look better or if making it look better is not possible simply to not report those with statistics at all so that the regime doesn't look bad. We might for example see human rights data that are missing from authoritarian regimes because these regimes abuse human rights sometimes and they don't want that fact to get out. They don't want it to end up in news reports or in data sets. So MCAR is I would say a dream more than an actual data generation process. But MCAR data is helpful to think about because we may be able to translate other more realistic forms of missingness into data that look like MCAR. And what I mean by that is consider this example of missing at random data. Data that are missing at random are systematic in missingness but that system can be modeled. We could say this is something like modelable missingness. So for example to go back to the previous example of sometimes authoritarian regimes may wanna hide their human rights records or economic performance or something. We might for example expect the missingness to be related first to the value of the missing variable. Human rights abuses are gonna get hit and not good records. And secondly we may be able to expect it to be related to the type of regime democracy or autocracy or maybe the degree of authoritarianism or maybe the degree of press openness. Who knows? There's a lot of all kinds of things that we might expect to be related to the missingness. That's actually, that's certainly not MCAR but it's a systematic process, right? It's not random. But we can model the missingness. We can say well the missing values are random once we've controlled for the effect of regime type and human rights violation level and all these other things. It's modelable missingness. And missing at random data is much more realistic, much more hopeful. And we make it look like MCAR data by modeling out the systematic process. We effectively say there's some error in missingness and there's some signal and noise and we're going to model out the signal so that all that's left is noise. So it's contingent or conditional, conditionally MCAR. That's actually kind of, I've never heard that said before but that's kind of a little turn of phrase there. This is sort of like conditionally MCAR is what we're shooting for in this missing at random data. And then finally there's missing not at random. This is kind of a long name for a sad face data because what this means is we've got data where the missingness is systematic in ways that we don't know or in ways that are related to covariate factors that we can't measure. So this is a case where effectively there is systematic missingness and we don't know why or we could not possibly hope to model why. And as you might expect, MCAR sometimes written N-MAR not missing at random which actually kind of makes more syntactic sense about whatever. This is a case where we're going to have the biggest problems with missingness. Each form of missingness has different potential consequences. Generally speaking, MCAR data, if you ignore it or simply drop the missing cases, this is primarily an efficiency issue. And what I mean by that is just simply dropping the cases, deleting the ones that have missing variables in them probably is not going to bias. In fact, they're proofs that under most conditions it will not bias your results. But it could certainly cause greater variability in what those results are both in the estimated standard errors and also in a repeated sampling sense in the overall size of the distribution. You're going to just get more variable results from sample to sample if your data are MCAR. If your data are missing at random and you don't model it. So in other words, if you unmodeled and you try to simply drop the data, use deletion, delete all the cases that have missing variables in them, you could get both bias and efficiency problems. The scope of those problems will vary from case to case but both bias and efficiency problems are conceivable. And the reason is pretty simple. You've got systematic omission of cases from your data set. This is akin to a sample selection bias problem as studied by many, many people over the years but probably most famously, Heckman. You've got a sample selection problem where your data are not being observed in ways that is related to the DGP and that systematic sample selection problem is going to cause bias in your estimates with the size and direction of the bias basically depending on what the sample selection process looks like. On the other hand, if we can model the MAR process, we can actually eliminate the problem altogether. And that actually also applies to MCAR data as well. If we can model them completely at random missingness which is actually really easy. If it's truly missing at random, you can just model it as a totally random process. We have ways of basically making these, of fixing these problems or at least reducing them. So we might have some maybe less of an efficiency problem and we can eliminate the bias problem. So in other words, if data are MCAR or MAR and you don't do anything about it, you're gonna have some difficulties. But we can also implement a model for the missingness or for the value of the missing variable or both that are gonna enable us to fix the problem and to eliminate the bias problem and reduce the efficiency problem. Now, as you saw from the sad face up here, this doesn't really work. Yeah, so we can fix MCAR and data, MCAR and MAR data, that should be MAR data, to look like non-missing data by filling in values based on what we do know, reducing bias and efficiency. So this is good. On the other hand, for MAR or MNAR, depending on how you write it, data, sad face, there's nothing you can do about it. And there's gonna be a bias problem and an efficiency problem, just like with the MAR data. The difference between MAR and NMNAR data is basically whether we can do something about it. Whether we can implement some kind of model, at least in theory, to fix the problem. So if you have NMAR data, you're host. If you have MAR data and you do something about it, we can make the situation better. If you have MCAR data and you don't do anything about it, it's not terrible, but not great. But you could also do something about it to reduce the efficiency problem. So as you sort of probably guessed from the way these things are pointed out, sad face is not a modelable process. So I'm pretty much gonna focus on MCAR and MAR data trying to model that, ways we can try to model that to reduce the bias and efficiency problems. So let's talk about ways we could do that. So of course, missing data is not a new problem. It's just as practically as long as there's been data. But the multiple computation methods that we're gonna sort of form the core of our discussions today are kind of new. And so it's worth asking what people did in the past and whether we expect those things that they did in the past to have been any good. I think the number one thing that most people do when they have missing data, and I've certainly done it in the past, is to just take the dataset and say, if any observation has any missing variable in it, drop the case. Just completely remove it from the dataset. This is, I dare say, the most naive, common naive method, not the most, well, maybe the most naive method, but certainly the most common naive method for handling missing data. And it's implemented by default in data in R, for most packages in R at least. For example, if you run LM, linear model on data, it'll simply drop file list-wise deletion in any cases. Stata does it as well. And in fact, it usually does so silently. It simply does it, and then at most it's gonna put a little note in the bottom saying so many observations are dropped due to missingness. R will do that. Stata won't even do that. It'll just report an N that's smaller than the N that you gave it. This quite obviously yields or leads to the maximum loss of information because even if I know the value of every variable, except one in the data, if I try to use that missing variable in a model, Stata and R will simply drop that entire case. I mean, you can imagine a hypothetical situation where I'm modeling Y with X, Z, and W. And I know Y and I know X and I know Z, but I don't know W. There's a lot of information in that case. Stata will just drop it. The same way it's gonna treat that case, the same way it would treat a case with four missing observations where we truly just don't know anything. I could add an infinite number of cases that look like that to the data set. It wouldn't help. That's, these two cases are different from one another. And Stata and R, if you don't ask it to do anything else, will treat it the same. And that's problematic. A treating data this way is gonna lead, in the MCAR case, this is gonna lead to the maximum consequences. You've got some kind of efficiency problem. In the MAR case, you're gonna have a possible bias problem and certainly an efficiency problem as well if you simply use this Y solution. So the main point of today's lecture is to give us some alternatives to just the Swiss deletion which is probably what we're gonna do anyway. Now, one step above the Swiss deletion is to try to replace missing values with something that we think is a good guess. That's what imputation fundamentally is. And what some people did to handle the missing data problem is a simple mean imputation. Now, a mean imputation is something like, okay, I've got missing, going back to my example up here, I've got a missing value for W, right, in this case. I'm gonna get rid of, well, I'll just leave this in. So I've got a missing value for W in this case. Now, suppose I've got lots of other cases. There's many, many other cases in this data set. And the mean for variable W is 8.5. That's the mean value for W. Well, maybe what I'll do is I'll just take this mean value right here and I'll just plug that in, 8.5. I'm gonna pretend as though the observed value for that case for W is 8.5. And I'll do that everywhere. There's a missing value. That's probably the second most common naive method for dealing with missing data. Just fill in the holes with pretty uninformative but not crazy estimates for the missing values, the means. Now, that actually will help a bit in terms of the efficiency problem. But the problem is you're gonna go from probably having estimates which are too variable to having estimates that are falsely too narrow. So what you're probably gonna end up doing in this case is underestimating the SES. Because when you put in the mean for a variable for its missing value, you pretend as though you know for certain that the missing value is equal to that mean. Well, in point of fact, you don't know that. In fact, you probably know that it's not the mean but it's a good guess. There's some uncertainty around it. It's better than no guess at all. But you're not accounting for your uncertainty in that imputation. So in short, if you were to really add in the right amount of uncertainty in the imputation that you really had, the estimates of beta that came out of your regressions using this imputed data would be, the standard errors would be bigger. So this is not, again, not a crazy method but we could do better. Another problem is that it really doesn't make much of an attempt to recover associations between the variables. So if I just plug in the mean for every missing value, say for example, I've got a couple of cases, let me scroll down here a bit. I've got a couple of cases like this. I've got y, x and z and I've got nine, missing eight and I've got six, four, whoops, four and then missing. Now, if I just know that the mean of x is three and the mean of z is five, I'm just gonna plug in three and five right there and be done with it. But you know, probably x and z are correlated, at least a little bit. For example, maybe if we did an analysis, we'd see that in the rest of the data set, the correlation between x and z is 0.6. That's useful information. What that tells us is that when z is big, x will tend to be big and when z is small, x will tend to be small. And likewise, when x is small, z will tend to be small and when x is big, z will tend to be big. So we could do better in our guess than just throw it in the mean. We could maybe try to get an association between these variables using our correlation coefficients, something like that. And this is why I also added, as a variant of this particular method, trying to draw out of the multivariate distribution of the variables. What I mean by this is we say, okay, they've got all the variables y, x, z, all everything else in the data set. These things have some kind of distribution. Like I could plot, here's x and here's z. I could probably try to figure out, okay, these things are distributed like this, right? And I could come up with summary statistics for them. I could come up, maybe I could say these are normal and these have means and standard deviations and so on. This is gonna allow me to do a little bit better by saying, okay, well, if I know x is here, then I can guess that z is probably gonna be somewhere in here as opposed to the grand mean. If x is out here, I probably know that x is a bit bigger, right? So if I use this sort of multivariate distribution of the data, I can maybe get a little bit better guess. That's good, that's good news. So there is a variant of this procedure that maybe gets me a limited attempt to recover associations between these variables. That's fine. But we could probably do even better yet. And that's when I get to this regression-based imputation. This is implemented in the impute command in state which is rather old now. I'm not sure it's actually formally deprecated, but it's certainly, there are better ways to proceed at this point, I believe. And what a regression-based imputation does is say, well, rather than guessing strictly based on the means or maybe on some kind of multivariate distribution based on the rows, what we're gonna do is we're gonna say, okay, if I need to know a missing, if I need to predict a missing value for x, I'm gonna say, well, x is a function of everything else in this model. So we're gonna call all the other variables capital W and I'm gonna estimate a regression on these with coefficients a and x, my prediction for x is gonna be a function of that progression or that model I've run. So I'm literally gonna say x is like alpha zero plus alpha one w plus alpha two z and I'm even gonna include the dependent variable y, that's the DV in this case. You can even put the dependent variable in and that is not problematic. In fact, it makes things better, one should do this because we're not, this is not a causal model in any respect, we're simply trying to get the best guess for x, period. So prediction is all that matters and if putting a DV on the right-hand side of whose prediction, so be it. So you run this regression model like so, you estimate coefficients, alpha hats and then you generate your x hats by using, right there. You plug all the other variables into the model and produce predictions just like always and that becomes your filled in variable. Now, just like with the mean estimates, we're not really incorporating uncertainty very well because as we probably know, as you probably know, these alpha hat estimates are variable. So we have uncertainty in the alpha hats that should be incorporated into uncertainty in your later model of y. So if I'm running a model of y is beta zero plus beta one x plus beta two w plus beta three z like that and I'm plugging in some values for x, if I'm taking these values for x, these x hats and I'm plugging them in over here, my uncertainty in how I got those x hats should filter into uncertainty in beta. And naive, just running a regression, predicting x and then plugging that back into the dataset wherever there are missing x's, that understates my uncertainty about the x's I plugged in. So just like with the naive mean estimate, you're probably gonna have some efficiency problems in here and in fact, the s's have a likelihood of being too small and by s's, I mean the s's of your final model estimates, your s's of betas, of your beta hats rather. So if I'm trying to model y using x, if I impute x with my independent variables, but I don't incorporate the uncertainty into my imputations in the final model, my estimates for beta, beta zero, beta one x, my betas are gonna be too certain, too confident. So these all make sense, they're all perfectly reasonable things to do but I think it's safe to say that we could do better, they're suboptimal. In panel data, there are lots of crazy things people do, I'm not crazy, I shouldn't say crazy. There are lots of things people do on a sort of ad hoc basis to try to fix missing data. For example, I've seen people do something like, well, if a value is missing at time t, so let me give this some extra room here, if I've got x at time zero and x at time one and x at time two and x at time three and these are all numbers of course, if I don't know, if one of these is missing, so like if this one is missing, well maybe what I'll do is I'll just fill in the average and put that in and put that there, so like this is a linear interpolation between x one and x three. Or I've seen some people do this, well let's just take this value and plug it in right there. There is some evidence that's noted in various outside references that this technique can make your estimates, your final estimates using this x variable can make them too confident and it can bias them. It can also not bias them, but whether it biases them or not is at least hard to assess a priori. So again, it might make it better, it might make it worse, it depends, but we can certainly do better and I should say this way of thinking would only possibly work in time series or panel data where you have multiple repeated observations so we don't even have a hope of using it in cross-sectional data sets. All right, so now I've sort of laid out some things, I don't wanna say not to do, because they'll all make things relatively better and really the regression based invitations not that bad, I've used it in the past for various reasons, but there's probably a bit better way to do it and so now I wanna talk about what are those better ways to do it? So let's take a look. So you may have noticed the common theme in what was wrong with some of the more naive methods of invitation, like regression invitation or putting in the means, is that there's no attempt in those methods to account for the fact that our choices for invitation are uncertain. And so one of the goals of multiple invitation, which is the method we're gonna talk about right now, is that it is able to account for the fact that our choices for invitation are uncertain. And this is in concept a relatively easy thing to describe. All you do, quote unquote, is the following. Use some kind of model, like maybe a regression model, to predict the missing observations using the observations that are not missing. So we have some information, some partial information. We fill in what we don't know using what we do know. But instead of just doing that one time, we do it a lot of times. And we do it a lot of times using variations on the prediction, on the model that we use to predict the missing values. For example, we've got, if we use some kind of prediction model that involves regression, so we've got our, the model we're actually interested in is y is a function of x and z, but we've got some missing values on x. So we specify, okay, I need an imputation model that makes x a function of a constant, the other independent variables and even the dependent variable. There's a VCV matrix of these estimated coefficients in the imputation model. That VCV matrix incorporates and measures our uncertainty in the extent to which we can use z and y to plug in values of x. So maybe what we'll do is, for example, pick a whole bunch of different values out of that distribution, using its asymptotic distribution like maybe the multivariate normal, and I'm gonna use a beta hat and sigma hat from this imputation model as the estimates of the mean and standard deviation, uh, or variance I guess, of that distribution of x, of the imputed x. So I take m draws out of the imputation model's VCV matrix. Now I've called this alpha up here, so I can change this to like, I've got alpha, I'm gonna draw m many copies of alpha out of its asymptotic distribution, which in this case is the normal for regression. Uh, so now I've got m many copies of alpha. Now I can predict m many values of the missing variable. So actually I should, yeah, here we go. If we take m many values of the missing variable using those m copies of alpha, so now I have m data sets, each one corresponding to a different draw from the imputation model. So I basically not sure what to plug in for x, so I just do it a lot of times, m many times, and produce m many different data sets. So now what I do is I calculate, actually the more I look at this, I really should differentiate between beta and alpha. So alpha, we're gonna call that the parameters of the distribution model. So I'm gonna actually fill in up here some of these entries for alpha so that we can be sure we're actually talking about the imputation model. So those are estimates for the imputation model. Now I've got my main model of y, so I'm predicting y using beta zero plus beta one x plus beta two z, but I imputed some values for x rather from the imputation model a zero plus a one x, or I'm sorry, a one z, z plus a two y, that was my imputation model that I used to plug in some stuff there. I drew m many copies of my fitted alpha values from the asymptotic distribution of alpha. So I have now m many copies of the data, that gives me m many beta estimates when I plug in those m many data sets back into my original model. So I've got m estimates, m many estimates of beta. I've got one for each one of my data sets. Then what I wanna do is calculate the final estimate of beta. I wanna in other words combine all these data sets I've got all the different betas that come out of those data sets into one final estimate of beta. And it turns out that that's relatively easy to do. All you need to do is just take the mean of the m many estimates, that's your estimate of beta. And the variance of your estimate of beta is just a function of the within data set variance of beta s squared. This right here is just gonna be the OLS estimate of the normal sigma squared. So just average all the different sigma squared estimates for all your different imputed data sets to get the W part. And then B is the between data set variance. So this is the variance that accrues due to uncertainty variance due to uncertainty in imputation, whoops, imputation of X. And your final variance in beta is a combination of W and B. It's W plus one plus one over m quantity times B. So this factor right here, or corresponds to the inflation in the standard errors, whoops, inflation in the standard errors of beta hat that we do to correct for imputation. So in other words, that component of our final beta hat estimates standard error, that component of it is the additional noise that we introduce by trying to plug in estimates for X. And there's a reason why the formula is not simply the sum of them or the average of them, but that particular weighted sum. Rubin originally developed that formula and it's sort of passed into accepted, the canine of accepted results at this point. So I'm not gonna bother proving why that's the case, but just assert that this is a formula that's known and that is the proper formula for the variance of beta after imputation. This is all actually fairly straightforward to explain, draw a bunch of different copies of X, run your final regression using all those imputed copies and then calculate the variance using your little formula here. That's fairly easy to do, fairly easy to explain. And it's actually really easy to do because there are so many packages in both data and R that are designed to actually do all this for you. It in fact generates the datasets using model, the imputation datasets using models that you specified. It'll run all the final models on all those datasets and then it'll pool the results for you so that you just get a little table of beta hats that looks exactly like you get out of a regular linear model. So let's, but actually before we get to the software, I should say that this generic overview of multiple imputation describes more or less what every package will do, but there are many, many details of how low level things are handled that are different and there's considerable debate and discussion and new research on the exact best way to implement this generic procedure. So I'm gonna talk about some of the different ways that these, that this procedure is implemented and some advantages and disadvantages that both are of these ways and then get to showing you how to actually use the software. So there are two major approaches, large scale approaches to multiple imputation that I'm gonna talk about in this video. And one of them is multiple imputation through chained equations or mice. The mice algorithm was developed by Steph Van Buren and in a variety of papers, there's even a book by Van Buren on this topic and has had, there have been multiple collaborators along the way. Mice is nice because it's fairly simple to explain and it works. So what does mice do? Well, first you start with a data set that has some missing observations and you just toss out everything where there's no information. So in other words, if I've got a data set with y and x and z and w and q, if I don't have data on any of these variables, I can't do anything. So I just have to throw out those cases. Now, this kind of makes sense because if I didn't have to throw out those cases, that would mean or imply that I could put in an infinite number of empty cases and somehow generate information out of them, which would, that would be quite a discovery if one could do it, but I have my doubts as to whether that's feasible. So you have to just get rid of observations that contain no information. Then for whatever you have left, which should be most of the data set, just start off by filling in random numbers for the missing observations. So for example, if I've got a line to this data, observation one here, and I know it's two, three, and then there's a blank, and then eight, and then four, I'm just gonna start off by putting in some random number here too. Now, it may not be a completely random number. For example, I might start off by filling in all the means. I might start off by filling in a draw from the distribution of z, the univariate distribution of z. So I know it's got a mean of four and a standard deviation of one. Maybe I just draw from that and put it in. So in other words, my guess is not stupid. I'm not guessing a million. But fundamentally, this is gonna be a fairly naive guess. This is not the final guess. This is just a quasi-random guess just to start off with. Then what you do is the following. So I'm gonna create a fake data set here. Blank, one, eight, blank, six, four, two, three, blank, one, one, four, eight, two, seven. Say that one says no missing values. So I filled in all the missing values here using random numbers. Step three says what I wanna do is I wanna start with a column and I'm gonna move through this column. For example, the first column is y. I'm gonna pick any observations that are missing, such as this one, all right? And I'm going to impute the missing value using everything else. So I'm gonna create a model for the missing values of y using x, z, w and q. I'm gonna come up with a prediction and I'm gonna use that to update my prediction here. So I'm gonna come up with some new guess that's gonna be better. Now you'll notice all of the random things I entered in here are going into this imputation model. So they might not be very good. Maybe they're great. We don't know yet. But they should be better at least than the random guess we made at the beginning or the quasi-random guess we made at the beginning. So maybe this changes to three. All right, fine. So I fill in anything that's missing here and then I move on to the next column. Well, there's no missing values for x, so I'm safe. Then I move on to the next column. Ah, we've got a missing value right here for z, okay? So now I say, okay, here's a missing value, an imputation model for x using y, I'm sorry, for z, using y, x, w, and q, okay? Y, x, w, and q. So now I'm gonna run that model, predictive value for z and plug that in right there. And this model uses my new value for, oops, uses my new value for three here. That goes in that I previously imputed plus these two old values for w. They go in here too. So already we've gotten a little bit better. Our estimate for z is gonna be a little bit better because we're using this imputed value for y here, not just a random one. Now I think you can guess where this guess is going. Now we move to the next column of w. We build an imputation model using y, x, z, and q. We impute using this model, these two missing values. And let's say our previous guess of here for z was like 1.8. These two values, 1.8 and three, these are now in the imputation model for w. So our imputation for model w is, our imputation model for w is incorporating information from our previous imputations for y and z. So we're gonna get hopefully better guesses, and I like maybe seven and two here, for the missing values for w. Then you just repeat that process over and over again. You repeat, repeat, repeat, repeat, repeat. You can have a fixed number of repetitions, do it 20 times. You could maybe have some kind of convergence criteria, and if you like, where the imputation stopped moving around, whatever. The idea here, hopefully, is that by doing this over and over again, each time you repeat the cycle, the imputations all get better because they're all sort of iteratively improving on one another. At a certain point, you're gonna just say, we're done. I'm done doing this imputation, and so, bam, I create a dataset out of that. Then I start over, and I repeat the process over again, and you do this whole process M many times to create M many imputed datasets. Now, there are two sources of variability in this imputation process. One of them is your random seeds that you started at the beginning might sort of lead you in a certain direction in terms of the final destination, of the final values you arrive at from imputation. Another source of variability is, if your prediction model, if your imputation model is like regression, and you're not just picking the maximum likelihood prediction, you're choosing out of the distribution of all the possible predictions, then you've got variation in the, you've got variability that accrues due to the uncertainty in the imputation model. So your datasets, you've done this M times, they shouldn't all look like each other. They won't look like each other. They won't look like each other because your seeds are a little different. You're starting values for all the missing values, but also because there's uncertainty in your imputation model, and the mice process when it fills in the values is incorporating that uncertainty. Now, I mentioned that you're not always just picking the one value for imputation. There are actually lots of different ways to impute variables using the mice algorithm, and the differences are all pretty much, I should put a thing here. The differences are all pretty much in step three and four. What method do you use to do the imputation and how do you choose the imputed estimate from that method? So one way of doing it is good old regression. Run, just like I mentioned up here, you run models for each variable with missing observations and you run a model on it and then generate predictions out of it. Regression is an obvious way to do that. It could be logistic regression if W is binary, for example, or multinomial regression if W is categorical, or linear if W is continuous, or Poisson if W is a count, the missing variables account. That's fine, makes sense, and you can either pick the Y hat, the maximum likelihood Y hat that comes out of that, or you can say, well, there's a distribution of Ys, I've predicted so in this case that W is the thing I'm imputing, so there's a distribution of W hats that come out of this, and I'm just gonna pick one according to the most likely ones get picked the most and the least likely ones get picked less, but there's still some variability in that, so either I pick just the maximum likelihood W hat or I pick out of the distribution of W hat, so I could either pick W hat or choose out of the distribution of W hat. Sample out of that. The default actually is not regression for mice, at least as implemented in the mice package in R. The default is something called predictive mean matching. Predictive mean matching is the default in mice for continuous variables, and what it involves is calculating the predicted value for a missing variable from the regression model, and then choosing the three cases, and you can actually change whether it's three, or five, or two, or whatever, pick some number of cases, the default is three, that have the closest predicted value in terms of Euclidean distance, and then randomly choose one of those three values to impute. So for example, if I've got a little data set down here, a fake data set, suppose I've got Y, X, Z, W, and I've got some numbers here, four, two, one, eight, six, blank, three, one, two, four, eight, seven, nine, four, one, three, two, eight, nine, six, something like that. Okay, I impute using a model of Y, I'm sorry, X on Y, Z, and W, I create a predicted X hat, and my X hat for this observation where it's missing is let's say two, okay? So I plug a two in there. Well, not quite. What I do is I say, all right, my X hat for this complete case would be one, for this case it would be 1.6, for this case it would be eight, and for this case it would be six. So what predictive mean matching does is it says, okay, the three closest X hats to my imputation value here are one, 1.6, and eight. That corresponds to observed X's of two, four, and oops, eight's not the closest, six is the closest. Hold on, sorry, six, here we go. So two, four, and eight, okay? I randomly, I roll a three-sided dice, three-sided dice didn't exist, but suppose it did. I roll a three-sided dice, one of those numbers comes up, I plug that number in. So suppose I come up with this one here, I plug in, actually that one's too, it looks too much like this. Suppose that it's, I pick this case right here. What I do is I plug in four, the value for that case. So what I'm doing is I'm saying I'm going to use real data to fill in real data, but I'm gonna use my imputation model to pick which one I fill in. That's the general idea. So pick the three cases that have the closest predictive values, and then choose the one that makes the most sense. So this is a way of kind of hopefully not relying so much on the quality of the imputation model. The imputation model gives you candidate cases, but the actual data come from the data set itself. I have to say there are many different ways of doing this. I've only very briefly hit on two. Each of these two ways, there are numerous sort of elaborations, differentiations that are possible on these two methods, and there are other methods besides, and both of these two sort of tweaks are inside of the larger approach of mice, multiple imputation using chain equations, which is itself not the only approach. So by no means am I telling you the right way to do it. I'm just giving you a way, and this is a way that in Monte Carlo studies has shown to be pretty good, pretty reliable, at least when data are MAR, but it's not the only way, and it's not the only way, another way that's popular is of particular interest to scholars using Bayesian modeling. And so I wanna talk a little bit about Bayesian data augmentation as an alternative to mice that might make a lot of sense if you're already using a tool like WinBugs or Jags to run your model. So let's talk about Bayesian data augmentation. So the idea of Bayesian data augmentation stems from the insight that multiple imputations such as those implemented in the mice algorithm are a form of Markov chain. You start with a particular value, and you update your value for, you start with a bunch of values for all the different imputations you're gonna make, and then you update each imputation based on the state of the rest of the imputed values. So as we saw in the previous notepad, your imputation for a variable W includes imputations for X and Z and Y and all the other variables that are being used to generate the imputation model for W. So if I've got variables W, X, Z and Y, and I'm going to update my imputations multiple times, when I start off, really there needs to be a zero here, but let's say these are the initial values that I've plugged in randomly for each one of these variables. When I go to create my first real imputation for W, I'm going to use all of these values that I originally put in to generate that value. And then when I go to impute X here, I'm going to use this and also these. This is exactly the way a Markov chain gets updated using a Gibbs sampler. Use all the previous states of the Markov chain, plus any updates you've already made, this particular iteration to create the new link in the chain for whatever variable that you're currently interested in. Well, if you accept the idea that this imputation process can be thought of as a Markov chain, well, we are already building Markov chains as a part of Bayesian estimates. We use Markov chains to draw samples from the posterior in order to derive some inferences about that posterior. So why don't we build a missing data model right into the Bayesian model that we're already estimating? In effect, just treating the missing observations is another parameter. So instead of f of beta being a function of the data, now f of beta and the missing values is a function of the data that we can actually observe. And that, just like everything else, is proportional to the distribution of the observed data contingent on beta and the missing values, and some prior about beta and the missing values. There's a proportional relationship between those two equations according to Bayes' rule. So all we're gonna end up doing with this augmentation process is sampling not only beta, but also the missing values out of the posterior using Markov chain Monte Carlo. And then to come to some inference about beta, we're just going to integrate out the missing values. And we're gonna integrate out the missing values, or the distribute, integrate out the distribution of the missing values, rather, by summing over them the way we would sum over any other incidental parameter that we weren't interested in. So for example, our Markov chain at the end of this is gonna have a bunch of draws for a missing value and a bunch of draws for beta. You can think of this Markov chain as just being a really long list of values that we've sampled out of. Now we don't really care about the combined relationship between these two things. We don't really care about this in other words. All we care about is this. So that's easy to do. For example, if we wanted to know the mean of beta, we would just take the mean of all these observations and that would give us the mean of beta integrating out y-miss. If we wanted to know what's the density of beta at some particular value, beta, you know, zero, what we would do is just take, okay, we're all the observations where beta equals beta zero and what's that fraction of the total. Now we might have multiple values of y-miss at these different observations, but we don't care. We're just summing over them to integrate out of them. In essence, what we're doing is we're just ignoring the information about y-miss and drawing inferences about beta. This is, like I said, the exact same thing we always do in a Markov chain when, for example, we want to get the marginal density of one beta estimate. So very often we have like a beta zero and a beta one and a beta two. And we don't necessarily care about the f of beta zero and beta one and beta two together. And we don't even really care necessarily about the conditional estimate of beta zero given beta one and beta two. We just want to know what's f of beta zero. And the way we would do that is just to look at the samples of beta zero and, you know, come to, you know, create a kernel density estimate or take the mean or whatever it is we're interested in and ignore the samples of beta one and beta two. Just concentrate on the samples of beta zero. That integrates out beta one and beta two in creating that density. The same thing happens in the case of missing data. We just impute it. Those are parameters we've estimated and then we ignore them. We just don't, we don't care about them. As long as the data are really missing at random and there's no, and the fact of missingness, in other words, the occurrence or absence of the data itself is not related to beta, which is sometimes called the Ignorability Assumption. Although Ignorability, my experience has been there are multiple definitions in the literature which aren't all the same, but whatever. The Van Buren book calls that the Ignorability Assumption. Then, or is it the Van Buren book? No, it's somewhere I read. This works fine. In other words, if we don't have to, if we don't have to account for the pattern of missingness, this is a perfectly reasonable thing to do. If we need to account for the pattern of missingness as well as the particular values of the missing variables, then we need to create a missingness model as well as a why miss model. So, binary variables say r equals zero if the variable's not missing and one if it is, you model that, and then you model f of y miss given r. So, if r is missing, or if r is one, what is the value of y? That's a so-called missingness model. And if we could model that missingness explicitly if we didn't have this Ignorability condition that held. But if it does hold, we don't need to bother with that. We can just impute y miss directly using some kind of regression model. Now, I think that the best way to understand these various procedures is to actually see them implemented in r or its data, see the actual nitty gritty of what they're doing, and look at some results. That's my opinion, the best way to understand them. So, we've got a theoretical background and some basic description of what's going on in this process. Now, let's see it in action. All right, so I've opened r. This is the lecture 10 file for today. And I'm gonna use the mice library, which implements the multiple imputation using chain equations approach that we discussed in the lecture. And what I'm gonna do is start off with a really basic mcar regression model. And what I'm gonna do is build a data generating process using the full dataset. But then what I'm gonna do is I'm gonna say, okay, suppose I can't actually see what the full dataset is. 10% of each variable is missing. So, what I've done is just basically drawn a miss variable and I'm gonna replace x, y, and z with an na value if those things are missing. So, my final x, y, and z variables have missing data in them. This is the dat data frame here. So, if I look at what dat looks like, here's x, z, and y, I've got some missing values, like so. Now, these missing values are missing completely at random. How do I know that? Well, here's where the missingness was induced. It's missing completely at random. I'm just literally flipping a coin that comes up one 90% of the time and zero the other 10% of the time. And if it comes up to zero, I drop the case. So, this is a very mild form of missingness. It's a missing completely at random case. The mice package allows me to create datasets and these multiple imputation datasets. And so, the mice command is what does it. And I can just call up the help here so you get a sense of what the particular syntax of this package looks like. The three basic things you need are a data frame that has some missing values, the number of imputation datasets you wish to create. Five is often, arrived at is the minimum number that you need. One of the surprising findings of Ruben's research is that you actually don't need to impute very many datasets in order to get a proper estimate of beta given the missing data. Five is often considered sufficient. With excess computing power, you can do better by estimating more. I'm gonna stick with five, but if we were really serious about this, maybe we would want to use 10 or 20. And then the max it function tells mice to repeat the multiple imputation process, the cycle in other words, 20 times for each variable. So we're gonna go through a cycle of an imputation 20 times before we say we're done and we're going to create a dataset. We're gonna spit out that multiple imputation dataset. So in essence, max it times M is the total number of times we're gonna have to run through this mice algorithm. Fortunately, this happens pretty fast on a reasonably modern computer. So it's already done. And now mi.dat is a collection of five multiple imputation datasets. We can visually assess how we're doing using some various tools. The BW plot tool creates box and whisker plots of the distribution of each variable in the original dataset and in the multiple imputation dataset. So the blue is the original dataset and the pink are each of the imputation datasets. And ideally, these should look pretty similar. And in fact, they do. The multiple imputation datasets for XZ and Y all pretty much look the same as the original dataset. Note we are imputing the dependent variable, which is gonna be Y, along with the independent variables X and Z. That is an acceptable practice and not only acceptable but recommended. It will improve performance. And it won't overstate the certainty in estimates because like we already noted, there's the uncertainty in our imputations is actually built into the process. So we're gonna incorporate that uncertainty into our final estimates of say, the relationship between X and Y. A density plot just gets into, it's the same basic idea as the BAM or the box and whisker plot, but using kernel density estimates instead of simple box and whisker plots. So we've created kernel density estimates for the original dataset, which is this solid blue line. And then for each of the imputation datasets, which is the red line, and the kernel densities are estimated for all the variables that have been imputed, which in this case is XZ and Y. And what you can see here is, these things should look the same and they pretty much do. The densities of the original dataset look pretty much like the densities of the imputation dataset. Nothing too crazy seems to be going on. The X, Y plot is pretty straightforwardly a way of just, excuse me. Sorry, it's pretty much a straightforward way of listing the original observations in the dataset and the filled in observations. So zero is the original dataset, which consists of only non-missing observations. One, two, three, four and five are the five multiple imputed datasets that I created and the red dots are the filled in dots. In other words, they're the imputed observations for each case. And what we'd be looking for in this plot is some evidence that all the imputed data look really different and are not in the sort of set of the originally imputed, the original observations. We'd also maybe be looking for some kind of systematic process. Like for example, all the imputed data, all the imputed values for X are all big. They're all close to one or close to a point five or close to zero, something like that. All small, all medium sized. In this case, what you can see is that the imputed data look like they're pretty spread out across the space, which is what we'd like if the data are truly missing completely at random and the imputation procedure is sort of what we would expect to see from, well particularly from NCAR data, which this is. Now, what we wanna do is take this imputation data set, this imputed data set and fit a model. And the with command allows us to implement a model with the multiple imputed data sets. So what it does is it recognizes that we've got five different data sets and we wanna run this linear model on each one and then we wanna combine the results in some way. So the fit object is where these results are being stored. You can see there are five different models corresponding to the five different imputation data sets. The pool command allows us to lump all the results together in a way that's consistent with Ruben's combination formula and then we can summarize those results to get a nice regression table showing us what we've got. So these are the results that come out of our multiple imputed data sets. The S is the estimated beta coefficients between X and Y, Z and Y and the intercept and what we should be getting based on our DGP is ones for X and Z and a zero for the intercept. And we're not too far off of getting that and here the T statistics for each of these they're all highly statistically significant. That's a little bit disappointing in the case. I'm sorry, the intercept is not statistically significant which is what we would expect to see. Sorry, I've got the yawns this evening. Anyway, that's all sort of within what we would like to see. And what we might wanna do is compare these multiple imputation data set results with the naive results of list-wise deletion. So here's where I run the model on the original data set with list-wise deletion. Here's where I run it with the imputed data. And what we can see here is that the Z coefficient gets closer to what we would expect it to be. In other words, it goes from 0.6 in the naive list-wise deletion case to 0.7 which is closer to the true value of one. The intercept goes from 0.2 which is further away from its target for zero down to 0.11, in other words, it also gets closer to its true value. And it goes from being marginally statistically significant to very firmly statistically insignificant. So in other words, in MCAR data, missing imputation works. It makes our results closer to what we would expect given the DGP and which makes sense because the MCAR condition is met just like it's the most favorable condition for imputation. Now, I have to say that if we did this many, many times, we would expect that the naive results would be clustered around, they'd be centered on the true values but their distribution of those results would be wider than the results that we would get out of multiple imputation because MCAR data is inefficient but not biased. That wouldn't necessarily be the case in MAR data where a failure to model would actually be more consequential and could cause bias. In other words, the naive estimates might not well, may well not be centered on the true values whereas we would expect the imputed values to be unbiased to work. All right, so now I'm gonna repeat this process same as before, same model and everything but I'm going to include more missingness. So here, I had a 90% chance of retaining the sample. Down here, I'm only gonna have a 50% chance of retaining the observation of the sample. So if I bind all these together, the dataset should be of size 200 but if I actually do a naive regression, I'm gonna have 182 observations deleted due to missingness because each variable has a 50% chance of being dropped which means that at least one of them has gone for most of this dataset. So this is a rather severe missing data problem in three quarters of your datasets missing. So we would expect mice maybe not to do so well on this case because you're really trying to extract a lot of information from very little data. Well, but let's see what happens. Let's see if we, how well we can do. So I bind all this data together into a data frame and I create my five multiple imputation datasets and there's a little more going on here. So maybe I'll up this to like 40 iterations and I'll do 10 datasets just to get a very, better sense of the variability. So if I get mice going here, it does this really, really quickly. Now, one of the reasons, while this is running to talk about, the one of the reasons that it was so important to keep M down to like five or six is because when these models were, when these ideas were first developed in the 1980s, it could take a long time to create these multiple imputation datasets and it can be quite annoying to do so. But now with modern computing, as you can see, I'm cranking out 10 datasets with 40 iterations each in absolutely no time. It's already done. So there's no reason, especially in a sort of final published analysis to do a little bit more imputation as long as you've got a reasonably new computer. So my bandwidth plot, or I'm sorry, my box, keep wanting to say bandwidth. My box and whisker plots, yeah, they look fine. If I do X, Y plots, now you can see I'm plugging in lots of observations because I got 50% missing this. But fortunately, they look pretty much like the original dataset. That's good news. Incidentally, I'm leaving out Y here, but I could easily show Y and my X, Y plot as well. Yeah, looks fine. My variables all pretty much look the same way the original dataset looks. My imputations aren't doing anything nuts. And there are no systematic patterns in the imputation such as all the observations being crowded at one end of the other. So I'm reasonably certain that my imputation model is doing its job. So now if I fit my model with imputation and then create a summary, wow, results are actually pretty good. X is pretty close to one, Z is pretty close to one. They're both statistically significant and the intercept is statistically insignificant. So pretty close to zero. If I do my naive regression, actually pretty good. We're looking out in this case in the sense that the naive regression is with list-wise deletion rather. It's actually pretty close to the right answer. My X actually overshoots. It goes from being about 1.4 away to being 0.33 away. So it actually gets a little worse in this case. My Z overshoots as well and my intercept gets closer to where it should be. So again, not surprising, given that MCAR data is less efficient when you estimate it with list-wise deletion but we can do better, but it's unbiased rather. And MCAR data is, when you use a model like Mison you're primarily fixing efficiency issues. So it makes sense that mice doesn't do quite as well with this radical degree of missingness as it did before but it doesn't do terribly. And it's on average probably still going to be better than, excuse me, still be a little bit better than, or actually quite a bit better in many instances than list-wise deletion. Now let's talk about a case where we have interaction variables. So I'm creating a dataset here where Y is a function of X and Z and also X times Z. So in other words, the marginal effect of X on Y depends on the value of Z and the marginal effect of Z on Y depends on the value of X. So I've got a case of interaction but there's 10% missingness in X and Z and in Y. Now you may have noticed that XZ is a variable that comes out of X and Z. It's not its own XZ variable. So when we go to impute XZ, it may not be the best idea to try to model it as a function of Y and X and Z using predictive mean matching or regression or whatever it is we could go on with. Rather it would be better to impute X, impute Z and then multiply the imputed value of X times Z to get XZ rather than to try to impute it. This makes more sense because for example, if you try to directly impute XZ and XZ separately, you can very easily impute a value for XZ that is not equal to the imputed value of X times the imputed value of Z. In other words, if I don't know X but I do know Z, obviously I don't know XZ but I'm not gonna try to fill in X and XZ separately, I'm gonna fill in X and then multiply it by Z. That's what I wanna do. We can do this in mice pretty easily. What you wanna start off doing is creating an MI object that actually doesn't have any missing data in it. So in other words, we're not gonna impute, we're just initializing the MI process. So set max it to zero and M to one to create one amount so an empty MI object. So an MI object, I'm sorry, rather a mice object has multiple characteristics associated with it and one of them is meth which corresponds to the method of imputation. And if we just type in meth here which is the meth object of MI in it, you can see that it's listing XZ, XZ and Y as having the method PMM. That corresponds to predictive mean matching just like we talked about in the lecture. But we don't wanna impute XZ via predictive mean matching. We just wanna fill in X and Z with predictive mean matching and then multiply those two things together to get XZ. So what we're gonna do is we're gonna change the meth of XZ to be X times Z. Now this twiddle here means to tells mice do this procedure. In other words, implement this formula, this passive, it's called passive imputation. Take already imputed objects and then do something with them to create this object. So I tells it that this is an expression that needs to be implemented literally, meaning it's a series of operations which should be executed by R. And the operations we're gonna execute are to take X and multiply it by Z. So now what we've done is we've changed the method to passive imputation so-called for XZ. We're gonna just multiply the imputed value of X times the imputed value of Z to get XZ rather than directly imputing XZ, right? The next thing we need to do is change the prediction matrix. The prediction matrix or PRED of a mice object tells you what variables are used to impute what other variables. So the row variables are used to impute the, wait a minute, let me think about this. Yeah, the row variables are used to impute the column variables. Is that right? No, the row variables are imputed using the column variables, that's what I meant. So for example, right here, the first column or the first row X, to impute X, we're gonna use Z, XZ and Y. To impute Z, we're gonna use X, XZ and Y and so on. Now I actually don't want to use XZ to impute X or Z. Because XZ comes from X and Z. It doesn't predict X and Z. So what I'm gonna do is set the X and Z rows for the XZ column equal to zero. So now X and Z do not use, you can see right here, X does not no longer use XZ to impute itself. It only uses Z and Y. Similarly, Z only uses X and Y to impute itself. Now we could go through the formal process of setting the Y component of XZ to zero here, but we don't really have to worry about that because we've already given it a formula, X times Z, that doesn't involve Y. So having Y in the prediction matrix for XZ, I mean, technically it's wrong or unnecessary, but it's not gonna change your results at all. So now that I've done that, I actually wanna impute the data sets. So what I'm gonna do is I'm gonna, I'm actually just gonna use five and 10 as the M and Max set here. And I'm gonna use the meth and pred matrix, or pred values rather, that I just created as a part of the mice process. So I'm now telling it, use passive imputation for XZ and don't include XZ as a predictor of X and Z separately. How have I done? Well, my box and whisker plots all pretty much look like the target data set or the original unimputed data set. My bandwidth plots look pretty reasonable. For example, this skewed shape here, all the imputed data sets look skewed on XZ. That's actually good. If the XZ distributions looks crazy, that would be a cause of concern. I actually wanna do XZ and then XZ and then also Y for my XY plot. So I'm gonna come in here and do that. They all look pretty good. One thing about, I notice is that the Y values are somewhat clustered around the center. That might mean there's systematicness in the, and of course I know that's not true because I created this data, but that might ordinarily indicate some kind of systematic process in the missingness of Y that I'd at least wanna think about, make sure I modeled that correctly so that it's M-A-R rather than M-C-R. I'm sorry, M-A-R rather than N-M-R. And then my XZ values, well, most of the imputed values are small. That makes some sense because XZ is a multiplicative variable. So you just have a skew, there's a skewed distribution which we saw here, which means most of the values that are imputed are small, but so are most of the values. So that's not terribly concerning. So what do we get if we estimate naive and imputed models with our interaction term? Well, all of the coefficients should be equal to one except the intercept which is equal to zero. And actually looks like we actually don't get terribly large differences between them. The intercept's a little closer to where it should be. X and Z are actually a little further than where they should be, but the standard errors are a little bit smaller, so that's good news I suppose for statistical significance. And the XZ coefficient shrinks which is a little problematic, but the standard error also gets smaller which is I suppose good. So it makes sense, if we did this over again with a different data set, we might get slightly different answers. So if I don't reset the seed, but just repeat all this over again, let's see what happens. Interesting, actually still getting a little bit further away. Huh, that's really interesting. Let's see if I do it again with yet different data. So now they're getting closer. Yeah, now they're actually getting closer to where they should be. All the coefficients are closer to one or zero. If I do this again, let's see what happens. So now they're closer to where they should be. These are a little bit too low in the naive data set and the XZ coefficients do high. They're all closer to where they should be. So you can see that in repeated samples, we generally do better, but there are some samples where we might do a little bit worse. And if we were concerned about the performance of this, what we might do is come in and create more data sets and a larger number of iterations per data set and see if that improves the performance of the imputation model a little bit. So I'm gonna go ahead and restart the original data set. So I'm gonna reset the seed in other words and see if imputing, doing more iterations and more imputed data sets actually improves performance. So let's break for a little bit while this runs. Okay, so we've let that run and the naive estimates are the same as they were before. And it looks like the imputation corrected data sets are also pretty close to where they were. The differences are a little bit smaller, which we would expect. The standard errors are still smaller, which is good, I guess. Huh, that's interesting. So anyway, that's how one would do imputation on a model that had an interaction term. Now, what if we had a binary independent variable? And the same thing actually goes for a binary dependent variable. So what I've done is I've recapitulated this data set, but I've decided to make Z instead of a continuous, Z is now binary, one or zero. It's a factor or variable with two levels, zero and one. If I induce 10% missingness in this variable and then create a data set, you can see that just sort of goes out of problem. The mice package will automatically detect that this is a binary variable and will adjust its method accordingly. So if you look at the MI init command here that just initializes MI, doesn't actually do any imputation and look at the method that it's gonna pick. For Z, it is picked logreg, which is short for logistic regression. So instead of doing predictive mean matching using standard continuous regression, it's going to impute Z using logistic regression, a logit model. It will automatically detect that the structure of the independent variable and adjust accordingly if the independent variable is a factor. In other words, if it's a factor type variable. If it's just a zero one number, it won't detect that. It'll decide its continuous and use predictive mean matching. So if you're gonna do this and you want mice to automatically pick up the type of your variable and adjust its imputation process accordingly, make sure that your dichotomous or polychotomous variables are identified explicitly as factors rather than as numeric variables or alternatively, you can be numeric variables so then you'll have to manually set the method to be logreg or polyreg if you're gonna use multinomial modeling for discrete multi-valued variables. So we can just go ahead and do our usual thing here, impute the data and run regressions with and without, with the imputation and without and you can see actually not too different. Looks like the results are pretty similar in these cases which is a little, I guess, encouraging. The intercept is smaller, which is what we would expect given that the true intercept over here is, where is it? Ah, here it is. The true intercept is zero. So actually they're both too big. It's kind of interesting. And both display significant which is not too encouraging. But the X coefficient is a little smaller and the Z coefficient is a little bigger which is both positive improvements in the direction that we would like given the pattern of missingness. Okay, so that's how you impute data in R using the mice package to implement multiple invitations using chain equations in a variety of different data structures. All right, so the last bit for today's video is about Bayesian data augmentation. And what I'm gonna do is return to an example that we talked about in an earlier lecture. This is data from a paper that Gina Terillo and I wrote on the relationship between women in government and corruption. And the basic model here, you can see in the bugs file, if I go actually to the model without, hold on a second, let me open the model without imputation. Okay, so here's the model without imputation. And what you can see is this is a pretty basic model where corruption, WBGI corruption is gonna be modeled as a combination of effects from women in government, total number of women in government and polity. And this I indicates that we're actually imputing, I believe we're imputing the value of women in government here using the impute command and this data. So what I'm gonna do is just do a real basic model. There's a random effect on region here in addition, and if I run this model without any kind of imputation, change the working directory to the source file location. There we go. So whoa, so now WinBugs is gonna estimate this model for me and it's gonna produce some coefficients for the relationship between women in government and corruption and polyscore and corruption. And in particular, what we're interested in is how the effect of women in government is different at different polity scores. So you may have noticed in this Bugs file that this coefficient right here, the total coefficient on women in government is a combination of B1 and D1, where B1 is the base level of a relationship when polity is zero. The D1 coefficient allows the women in government coefficient to change for different values of the polity score. And ultimately we're gonna come up with is this. This is a relationship for different values of the polity variable. In fact, I might change this up a little bit, so it's clear. The x-axis gives us each polity score and the y-axis, not x-lim, it should be x-label for label. And the y-axis is the change in corruption given a 1% increase in change of women in parliament. And actually I think I can make this, yeah, that's fine. So what you can see here is that for small values of polity score, there's no relationship, or at least no statistically detectable relationship between women in government and corruption. But for democracies, there is a positive relationship where more women in government means, actually it's more cleanliness, that's a cleanliness measure, so bigger values mean less corruption. So now what we wanna do is think about maybe doing some analysis where we don't impute the values using this simple impute command in Stata, which is where that i comes from. And maybe we also wanna impute missing values for polity and for corruption score. So what I'm gonna do is I'm gonna add a multiple imputation model to this particular, to this model. And the multiple imputation model is featured up here, right there. This is the multiple imputation model right here. And what you can see what I'm doing is I'm saying, okay, I'm gonna create a matrix called miss that consists of two vectors, the vector of women in parliament and the vector of polity score. And I'm going to suppose that those variables are multivariate normal distributed with some kind of vector mean and some kind of VCV matrix precision. And I'm gonna impose pretty diffuse priors on the means for the women in parliament and the polity scores. And I'm also going to assume a fairly diffuse Wischert distribution for the precision matrix of the relationship between these imputed variables. Now all this, I mean, I could maybe just sort of junk this and leave it just like that. I'm adding in this extra stuff because I know that women in parliament and polity are bounded variables. Women in parliament is bounded between zero and one. It can't be less than 0% or more than a proportion of one or 100%. The polity variable is bounded between negative 10 and 10. Now I could actually do even further and make this, you know, use the step function to make this a step variable because polity is a discrete variable with negative 10, negative 9, negative 8 and so on. There's no such thing as a 9.2 polity score. But I'm just neglecting that particular complication for this model and allowing it to impute partial work or fractional values of polity. But I do want to bound them to be between negative 10 and 10 the way the scale is designed. So I've created this max and min bunch here just to bound the imputed variables. And that's all I do. I've just created a model. I've basically modeled these variables as being multivariate normally distributed with some kind of common means and some kind of precision matrix. Now this corresponds to the idea of drawing the missing values from the multivariate normal distribution. So I'm allowing there to be covariance between the two missing values and I'm allowing them to have non-zero means. So I'm capturing some degree of the extent to which polity and women in parliament go together. But I'm not doing it quite as in quite as complicated a fashion as using a fully fledged fleshed out regression model. Now in a bi-variable case and where you've got two missing variables, the difference between a multivariate normal imputation and a full fledged regression is gonna be minor because rho is gonna be analogous to beta in a two variable regression, a regression of women in parliament against polity for example. So we're not gonna gain a whole lot by using a fully fledged regression model. But we could gain some. And in fact, one thing I didn't do is I didn't include the dependent variable in this imputation model. So I'm losing a little bit of information there. Now you might be asking yourself the dependent variable has some missing values as well. Where are we imputing those? Well it turns out that all you need to do to make a Bayesian model impute values is to make that value stochastic. So here's the dependent variable WBGI corruption score. It's normally distributed with mean and precision. And here's the model that we originally specified right here. You can see there's got a random effect at the end there. There's the effect of polity. There's the effect of women in parliament. But I'm using the imputed scores for women in parliament and polity as opposed to the raw scores like I did before. And the dependent variable is gonna be imputed automatically when it's drawn out of its distribution. So in other words, missing values in the dependent variable are imputed strictly by the process of estimating a Bayesian model. At least when you do it in Jags, WinBugs and OpenBugs. So I don't have to think about imputing the dependent variable model because that happens as part and parcel of the process. Now one disadvantage is I'm not including the information about the dependent variable in the imputation model for polity and women in parliament. That's true. I do miss out on a little something there. But what I gain as a result of that is that the imputation model is not separate from the regression model. In fact, the information from each one can feed back on each other in some ways unless I specifically stop it from doing so. And so I'm estimating my model really much more holistically than I would in a mice type framework where I impute the data sets and then kind of patch it on the back end as an afterthought to the main regression process. Here the modeling of the dependent variable and the imputation of missing values is one holistic giant model that I'm modeling with one posterior. So it's a much more integrative process and it can in some situations it can actually be more accurate than a mice style imputation but it's a little bit harder to program. So I'm gonna repeat my regression model but I'm gonna use this Bayesian imputation process and I'm going to plot on top of my original plot here the new results that come out of using my imputation model. So WinBugs is gonna do its thing. It's estimating the imputation values and the model values at the same time transparently and you can see it's just going about its business the way it always does estimating parameters is though there's nothing different than what we did before. So when this is done I'll show you what the results look like. All right, so the red line here indicates the results of the model with multiple imputation of both variables through the Bayesian modeling process whereas the solid line is just straight Bayesian regression using no imputation for the polyscores and only standard regression imputation for the women in problem scores that I actually did in STATA before as part of the preparation process for this dataset. And what you can see is the estimates of the line are pretty much of the same slope. The slope doesn't change a whole lot but there's been a shift downward. And so this is in effect, it may not be biased in the sense of one of these lines is closer to the real estimate than the other but in general, we would expect the Bayesian imputation model in repeated samples to have a tighter distribution around the true estimate or I'm sorry, the true value of the parameter than we would either a completely naïve list-wise solution model or even the sort of basic regression imputation model that we use that didn't account for uncertainty in the regression estimates. So maybe this model is more believable and it does tell a slightly different story. The story that this model tells is that the polity score here for the most autocratic countries, more women in government leads to more corruption and less women in government leads to less corruption but for the most democratic countries, more women in government leads to more corruption, less women in government leads to less corruption. That's kind of interesting. It's certainly a somewhat different story than there being no difference in autocracies for women. I'm not sure I have to do some more investigation to figure out whether this is something I believe, although I have to say that both stories are not terribly inconsistent in the sense that both of them emphasize the fact that women are cleaners. They clean up corruption in democracies but not really so much in autocracies. So they're both both models are supportive of that basic result. But whether women actually increase corruption in autocracies, that would be a rather different wrinkle and so might need to look into that a little bit more. Certainly I would want to use more control variables like in the published version of the paper before I concluded that with certainty. But you can see there's an interesting applied example of, hey, using multiple invitation on a model can actually get you different results and results that are different in substantially meaningful ways. All right, so that's a pretty straightforward and I'm sorry, a pretty basic rather, surface level introduction to the topic of multiple invitation. You can go much deeper into it. There are whole books written on the subject and this is an area of active research in methodology so that many of the questions have not yet been answered and so this is an exciting place to get involved with new methods if this is the kind of problem you're interested in. But hopefully this will give you some tools to use and some ideas to think about if you encounter missing data in your own models which you almost certainly were and give you something to do beyond simply saying, well, screw it, I'm gonna delete the missing observations and pretend like nothing's wrong. It's nice to have a little more to try rather than that very basic approach and this should help you get started on that process. Thanks a lot and I'll see you next time.