 Sometimes you have missing data in the dependent variable and the missingness depends on the actual missing value. For example, if you study the wages of people, whether you actually observe a wage for a person might depend on the wage that the person was offered and then the person declined the job. In this case, selection models can be useful. The most commonly used and well-known selection model is the one introduced by James Heckman in 1976 and 1979 papers that won him the Nobel Prize in Economics in 2000. How does the Heckman selection work and what kind of assumptions does it make? Let's take a look. So, Heckman's selection model and selection models more generally address the missing not-at-random case. So, the worst case scenario in missing data is when you have missingness in the dependent variable and the actual, whether the observation is missing or not depends on the value that is missing. So, if the missingness depends on the value that is missing, then of course we can't test what is the missing data mechanism. So, this must be assumed based on theory. For example, if why here is the wage that the person was offered and then we observe only those people who decided to accept the job, we could infer based on theory that if a person is offered a very low wage, then that person would rather stay at home, for example, to take care of the kids or stay at home and enjoy unemployment benefits or some other reason. So, this is the scenario that the Heckman selection model addresses. When Heckman's selection model is used, you can see a couple of terms appear in papers. So, this is not a good example of how the Heckman selection model is applied, but it shows you the key terms. You can see citations to 1979 paper, you can see the word Heckman, and you can see the word inverse mills ratio. So, what is, who is mills, why is the ratio, and why do we take an inverse? Normally, in a selection model, you run this kind of equation. So, you have two equations in the selection model. So, you have the selection equation. This example comes from Ender's book, and then you have the main model. So, in Ender's book, we have IQ explaining job performance, and then we have well-being predicting on whether that person stays in the job or not. So, we have one equation that tells us or tries to explain if we observe the value of the dependent variable for a person, and another model that tries to explain the differences in the dependent variable, controlling for the fact that some of those values were not observed. And how this works is that we must make the assumption that the error terms of these two models are normally distributed, and we assume that they are correlated. So, whatever causes job performance also causes the missingness. So, this is basically that an unobserved variable, unobserved part, or the error term of job performance also correlates with the selection. And this selection model is a probit model, and the idea of a probit model is that the job performance data is observed if this predicted value plus the error term zeta is greater than zero. So, at this point it's a good idea to recap what the probit model was about, and the idea of a probit model was that we estimate the linear prediction first. So, here's our linear prediction, we call it y star, and this is like a normal regression model. And then we add a normally distributed error term to the linear prediction like we do in the recursion analysis, and instead of observing that y star directly, we observe one if y star is greater than zero, and we observe zero otherwise. And here one is that the person was observed, and zero indicates that there is missing data for that person. So, this is briefly the probit model, and if you don't know probit model, then use the probit look at another video where I explain this model in detail. Why we use the probit model will become clear pretty soon. So, let's get back to the selection model, and let's use an example. And this example comes from status user manual. This is a way to implement the selection model using status or generalized structural ecosystem modeling. We will not be using it structurally because I'm modeling an example, but this is a nice path diagram, a nice data set that shows what we're going to do. So, we are explaining the wages of people, women in this case, and how the wages depend on age, education and age. The problem is that if a person was offered a low wage, they would choose not to work. And this is the problem that Heckman addresses when he was developing the selection model. We also have a selection equation, and we think that whether a person decides to work or not depends on whether they're married or not, and whether they have children or not. We could, for example, theorize that a person who has many children needs more money, and therefore is more likely to work. We could also theorize that a person who has many children would favor staying at home with the kids instead of going to work, depending on what kind of benefits you get for having kids from the government in the society where you live. But anyway, we think that whether you're married, whether you have kids affects whether you go to work or not, and we also think that the marriage or the number of kids will not affect how much money you are offered for a job. And it would be illegal for that to matter, at least in Finland where I live. So we run a regular model, or Heckman's selection model, using this data. So these are artificially generated data by this data. And we get two sets of estimates. We get the wage equation. This is the equation of interest. And we get the selection equation. And this selection equation is simply a probit model. So we can see that if you're married or if you have kids, you're more likely to work. And age has an effect. If you're older, you're more likely to work. If you're higher education, you're more likely to work. And then now this wage education here controls for the selection effect. So how does this actually work? And let's take a look how it works. And here's the selection model again. So this is from Ender's book. So we have job performance, a dependent variable. We think well-being explains whether you actually stay at job or not. And we think that well-being does not affect job performance. And we think that IQ does not affect selection. Normally, we would regress selection on IQ because that's how the thing is done. I will get to that a bit later, but let's take a look at the basic idea. So the idea here is that we try to control for zeta. So if we could control for zeta somehow, then this epsilon would be uncorrelated with the selection. So that's the key inside here. This selection is kind of like an omitted variable problem. If we can come up with the right control variable, then the epsilon after controlling for the right control variable will be uncorrelated with the selection. So the key problem here is that we need to estimate the zeta here. And how do we go about doing it? How do we estimate zeta? And this is where the probit model and the normal distribution assumption becomes important. So the idea of a probit model was that we observe a one if this equation here is a y star is greater than zero, and we observe a zero or we don't observe the dependent variable that is missing as if this equation is less than zero. And how does it work? So we have linear prediction. So we have our betas times the observed predictors in the selection model. We call it XB. And if the linear prediction is A, then we know that the zeta must have a value of at least minus A. So for example, in this case, the fitted value of the linear prediction is minus one. We know that then zeta must receive at least the value of one for that case to be observed. And then the area on the right hand side of the normal curve gives the probability of actually observing the case. Also, if we know that zeta is at least minus A, then we need to calculate what is the expected value of zeta. So if we know that zeta must be at least one, so this is the normal distribution zero is here centered at minus one, and this is plus one standard deviation about the mean. So if we know that zeta is a plus one standard deviation, then what is the expected value of zeta? So we basically know that if we observe a case, then zeta did not receive any of these values. So it didn't get any of the values that are below one. It must have gotten a value above one. So assuming that the value is above one for zeta in this case, what is the expected value of zeta? So we basically would have to integrate over this normal distribution, this part, and that would give us the mean. So we don't actually have to do the integration because the expected value or the mean of this data is given by this kind of equation. So basically what this quantifies is the ratio of this probability density, so the height of the normal curve at the cutoff point divided by the area under the curve. And A is the predictive value, so A is the value that we need, and one minus A is this area on the right-hand side of the curve. The reason for that is that the cumulative distribution gives us the area on the left-hand side of the cutoff, so one minus the cumulative distribution is the right-hand side. There are different ways of calculating or expressing this quantity known as inverse mills ratio. So why is it an inverse mills ratio? There is a statistic called mills who worked on the ratio of probability density, probability, cumulative probability, and divided by probability density, so that's mills ratio, and this is just the opposite of the same ratio. So mills was working on the ratios of probability, cumulative probability divided by probability density, and we are dividing probability density by cumulative probability, and there is not much more to that. So if you want to know where this comes from, you need to study statistics, but I don't think that understanding where this comes from is very useful for an applied researcher. It's a bit unfortunate that the name inverse mills ratio is used in applied papers. I would think that, for example, expected zeta would be a lot more descriptive of what the inverse mills ratio actually quantifies. Let's take a look at how this works in a bit more detail to try to understand what the Hecron Selection Model is doing. So this is the probate model using the same data set, so we are running a probate model selection on married children, education, and age, and this is the probate coefficients. They are, of course, the same that we have in the Selection Model because we are estimating the same probate model. And we will then take the predicted values, the fitted values, they are called xb, so x is the observed data, b is the regression coefficients, so observed predictors times regression coefficients is the predicted values, and here is the predicted value for a particular case. So, for example, the first case, the predicted value is minus 0.69 and this case was not observed and so on. And then we have min zeta, so I generated the value of min zeta, and this is the value that the zeta must receive at least for this case to be observed. And I only calculated it for the selected observed cases. We could of course calculate it for all cases, but I chose to just calculate it for those cases that we observed because those are the ones that we use in the main equation. So if the predicted value is minus 0.2, then zeta must receive a value of at least 0.2 for that case to be selected. So zeta is normally distributed with the mean of 0 and standard deviation of 1. And so if we require a positive value for zeta, we know that the probability for being selected is less than 50%. And then the inverse mills ratio tells what is the expected value of zeta, given that we know that zeta must be at least 0.2. So this is the idea. And let's take a look at the inverse mills ratio a bit more. So here is a hypothetical zeta. So this is the distribution of zeta. And the gray area is those values that would cause the case to be selected. And if the predicted value is very large, then the case will be selected no matter what. So if the predicted value is, let's say, the linear prediction is 5, then pretty much any value of zeta that we add to 5, because zeta is standard normal distributed, will lead to a value that is greater than 0 and therefore the case being selected. This red line gives us the average or the expected value of zeta, assuming that zeta is greater than whatever cutoff we have. So we can see how the inverse mills ratio, the expected value, depends on the minimum required zeta. And this animation shows it that when we start increasing the required zeta a bit, then we can see that the mean or the expected value of this remaining zeta goes to the right, so it increases. And this basically just demonstrates what the inverse mills ratio quantifies. It quantifies what is the expected value of those values of zeta that would cause the case to be selected. And we can see that as the required zeta increases, so does the inverse mills ratio. In fact, if we take a look at the relationship between the linear prediction and the inverse mills ratio, we can see that this is a decreasing function. So if the linear prediction is very large, if we say that, okay, the linear prediction is 4, then pretty much any value of normal distribution would qualify and the expected value from normal distribution is 0, so the inverse mills ratio is 0 as well. When the predicted value goes to, it's highly negative, for example, minus 4, then we need a very large value from zeta, only the right-hand tail of the zeta qualifies and produces y star that is greater than 0. So the relationship between the linear prediction and the probit model inverse mills ratio is here and we can observe a couple of things. For the xb, the linear prediction is, of course, a linear function of the observed variables. We can also see that for the most part, inverse mills ratio is approximately linear function of the prediction. So if we take, for example, from minus 2 to 2, then a line would be a pretty good approximation and that is where most of the cases fall between plus and minus 2 standard deviations from the mean. So the inverse mills ratio is close to a linear function of a weighted sum of the independent variables and this causes a bit of a problem. So if we include the inverse mills ratio from a probit model, if we calculate it using the same set of predictors that we use in the main model, it might work but it would produce an extremely collinear model. So we would need a large number of observations and also the identification of our model would depend on the fact that this curves a little bit and justifying this kind of functional form would be very difficult based on theory. In practice, we need to have instrumental variables to avoid this collinearity issue. So the collinearity issue is avoided here in Ender's paper by having well-being being a predictor of Selexon but not a predictor of a job performance. So in practice, you need to have these instruments in the Hikma-Selexon model and the reason why we need to have them is that otherwise the inverse mills ratios will be perfectly or nearly perfectly correlated with these predictors in the main equations. So if we have a very large sample size, we could in theory estimate the IQ's effect, only IQ in the Selexon model and get this to actually run but we would rely on the differences in the functional form linear here and probit here for identification and that's a bit tedious because we know that probit model is roughly linear for a large part of the observations so we would rely on the ends of the distribution for identification and that's challenging. In the status example, we have two instruments so we think that whether you're married, whether you have kids, they affect whether you go to work or not, that's reasonable and these are not supposed to affect how much wages you get. In fact, in many cases, if you apply for a job, you wouldn't tell the employer whether you're married or how many kids you have and therefore there's no way that the wage that the employer offers you would depend on marriage and children because that's not something that the employer would know and these are instruments because they are uncorrelated with the wage determinants. So are these models useful? They are certainly widely used but there are some big caveats and these caveats are that the Hekman Selection Model is highly sensitive to the assumption that the Selection follows the probit model. So you saw that when we calculate the inverse mills ratio, we use the normal density and the cumulative normal distribution for calculating the quantity. If those distributions actually don't characterize the selection process then the inverse mills ratios or the expected values of zeta will be incorrect and the results will be widely mislead. There is some research that shows that the Hekman models assumptions are actually very important if the assumptions don't hold then it might be better to just estimate the selection equation estimate the main equation with other missing data techniques that assume missing at random instead of trying the Selection models. Nevertheless these are very commonly used, people generally don't justify the assumptions and justifying the assumptions would be very difficult to do based on theory because it is difficult to justify the functional form of things that we don't observe. How do we justify that there is a strict cut-off for y star that we don't observe? y star, how do we justify that zeta is normally distributed? We pretty much can't. So because of that some methodology or sources say that these should be used more as a sensitivity analysis. So we could do a normal model with missing at random techniques such as multiple implementation or maximum likelihood for missing data and then do a Selection model then compare the results and say that okay so the first set of results shows there are estimated effects assuming that missing data mechanism is MAR and the second set assuming that there is a normally distributed Selection process. We of course don't know if idea of these assumptions are true but that would give us some kind of ballpark estimates of where the effects might be. So HECMA Selection model while it's a very useful conceptually there are some empirical challenges that make it less useful than what it seems and that's one of the reasons why for example Ender's book takes a rather dim view on the Selection models.