 Auxiliary variables are very useful when dealing with missing data. Auxiliary variables are variables that can predict missingness, but are not interesting for the main model. Let's take a look at how auxiliary variables work. The idea of an auxiliary variable relates to the missing data mechanism. So the missing data mechanisms that we have are missing completed random, which is not problematic. We are missing at random and then we are missing not at random. Missing not at random is of course the most problematic. We know that missingness R can depend on the observed predictors and the observed dependent variable, but it also depends on variables that we don't observe. We call these variables Z here. So if we observe these Z, we can actually do a couple of things. The Z variables are assumed to be correlated with selection. So we might say that, for example, whether you have kids or not depends on whether you are working and whether we observe your weights or not and that kind of things. And the Z variables are independent of the study variables. So they are irrelevant or based on theory, they are expected to be uncorrelated with the main study variables. So what we can do is that we can use these as auxiliary variables. How it works is that we use the information from these auxiliary variables, other variables. They are often demographic variables and other things like that to make our missing data estimation better. And one nice thing about auxiliary variables is that in some cases, they can convert this problematic missing not at random pattern to a less problematic missing at random pattern. Let's see how that works. So the idea is of converting missing not at random pattern to missing at random pattern is illustrated by this example. So this comes from the Ender's book. And we have a model where we are interested in the correlation between self-esteem and whether the person is sexually active or not. So these are our teenagers. And we don't think that sexual activity itself has a causal relationship on being selected in the study. We assume that esteem does have a causal relationship. But we have theoretical reasons to believe that age actually influences whether a person is selected to the study or not. And we know that age determines your sexual activity so that older people are or closer to adulthood are more sexually active than younger teenagers. So if we don't model this age, which is not the age is not important for our theoretical model. If we don't model this age when we do our missing data model, then we can see that there's a spurious correlation between sexual activity and being selected. Now if we add age to our missing data model, then controlling for age, sex and our selection would be uncorrelated. So that would be nice. And how do we actually go about doing this? If you do a multiple implantation study, then you simply add age as a predictor in your implantation model. So simply adding more variables into the implantation model and then run the implantation, then leave age out when you do your main analysis, this correlates. If you do full information maximum likelihood, then you need to do something like this. So these AV1 and AV2 are the auxiliary variables, X1 and X2 are the main variables and Y is the dependent variable. In SCM framework, we would add the auxiliary variables and allow them to be freely correlated with any of the predictors and also any error terms of observed variables. So this is an observed variable model and there is just one error term. So we are saying that these auxiliary variables are freely correlated with everything. We are not interested in their correlations and they would not be affecting model fit at all. But interestingly, even if we have these variables that don't affect the path coefficients, don't affect the model fit, simply by adding them to the model allows the FIML algorithm to use the information from these variables to compensate for missingness. And that's a nice thing to have. In more complex models, if we have a latent variable model of the same, so we have X1 and latent X1, latent X2, latent Y, then we need to add these auxiliary variables and we have every correlation with auxiliary variable and the error terms of the observed variables to be freely correlated. Of course, the auxiliary variables are freely correlated with themselves, too. So what's the logic here? The logic is that we don't constrain the correlations between any of these variables and any of the observed variables to any values. So importantly, this error term of the latent variable because the latent variable is not observed, that should not be freed because that would make the model not identified. So we do the simplest possible thing to allow these auxiliary variables to be freely correlated with the observed variables and that is adding the correlation between the error terms and the auxiliary variables. Now this kind of specification is not directly supported by every statistical software. For example, I'm not sure if you can do this with data, but what you can do always is to just specify additional latent variables that predict these observed values. So they are basically kind of like error terms that you specify yourself. So there are workarounds if your statistical software does not allow you to do this kind of specification. How many auxiliary variables do you apply? Well, there's an influential study by Collins in 2001 that concluded that the more is always better. But the current understanding is that there are limitations. So technically there are auxiliary variables and how they relate to selection or missingness. So it's not a causal relationship. It does not need to be. So we don't need to care anything about measurement error or endogeneity or reverse core seller or whatnot. We just need to calculate predictions. And to calculate predictions, more predictors is better assuming that you have a very, very large sample size. In practice, there are limitations. So if you include 100 auxiliary variables, then your model might have difficulties in converging. There's an absolute limit that you cannot have more auxiliary variables or more variables in the model than you have observations, because then you will run into a perfect collinearity issues. So the current recommendation is that you include auxiliary variables very liberally, but if you have problems in getting the model to converge, then you should take some of those away. And the key thing is that if we have two auxiliary variables that are highly correlated, then that may cause a problem in convergence and then taking one of those out might help the problem. So as a general advice, using background variables that are not interesting for the main research question as auxiliary variables is pretty much always a good idea when you do a missing data analysis.