 Missing data is very common in empirical research. In fact, almost every project that I've done has had some missing data. Whether missing data is a problem depends on a few things, but fortunately there are also tools that we can apply to deal with missing data. Another related concept is sample selection. In this video I'll give a quick overview of what missing data is and what sample selection is. Let's start with an example. This is a dataset that I like to use when I teach rigorous analysis. We have a bit more than 100 observations of data and the units that we observe are occupations from the census of Canada in the 1970s. Our task is to explain income with the other variables. So we have education, share of women and a prestige score for the occupation and we want to explain income. The data are complete, so we have every value for every variable for every case. There are no holes here, no missing data. So this is easy, but what if, for example, we don't observe the prestige score for the government administrators? What do we do about it and what's the consequence? If we use this data to run a rigorous analysis, then by default our statistical software regardless of which software we apply simply will drop government administrators from the data and run the rigorous model using the rest of the cases. Is that a big problem? We will still have a bit more than 100 observations, so maybe dropping one out of a bit more than 100 is okay. Okay, so what if that's not the only missing value? What if we have missingness also here? So we have general managers and share of women is missing. Two observations out of a bit more than 100 probably not a big deal. We can just run a rigorous analysis. Okay, what if there is more missingness? If we have a missing value for one variable in every case, if we run this rigorous analysis with this data, then the default regression command will just report that there is no data to run because by default the action is drop all the cases that have any missing data. So are these data useless for estimating the effects of education, share of women and prestige on income? The answer is no. There is lots of information. We still have 75% of the available information because we have four variables and only one fourth of the values are missing. It just happens to be that that missingness is spread out through all the observations. We can estimate the effects of education, women and prestige on income, but to do so we need to apply modern missing data techniques. Simply eliminating observations will not cut. This is a very common scenario. So quite often researchers face with a limited number of missing cases or missing values. For example, in Nyli Rengös article that I used as an example measurement, they report that they had 180 final cases after dropping 15 observations. The amount of data that they drop is about 8%. Is it okay to throw 8% of your data away without analyzing what is the impact of doing so? Probably not always. Sometimes throwing 8% of data away might be okay. In other cases it might not be okay. Depending on why the data are missing and what is the missing data pattern and what's the mechanism for creating the missing data. This is something that the paper doesn't address. So it's quite common that as long as the data are missing, the amount of data that is missing is less than 10% and people would just drop the cases and be done with it without explaining. If I read a paper, if a paper drops more than 5% of observations, I'll ask the authors to justify the assumptions that are required for unbiased estimation if you drop the observations. In practice I'll ask about the missing data mechanism. I'll talk about the mechanisms in a different video. So what is missing data and what is selection? Missing data refers to a typical scenario where we have some cases and some variables that don't have data for some of the cases. There are a couple of techniques that people commonly apply that are bad. First, deleting cases, unless the missing is random and there is not a lot of missing, it's not a good idea. Even worse, people use something called mean substitution. So if you have a variable that has a missing value for a case, then you take the average of all other cases and you use that average in place of the missing value. This is a common strategy and it is always a bad idea. Dropping cases is a lot better idea than inventing data based on mean substitutes. There are better techniques for dealing with missing data. You can use maximum likelihood estimation that takes care of the missing data problem by feeding the model to the data that you have and ignoring the missingness. This is called full information maximum likelihood because it takes also the information from those cases that are partially observed. FIML is the acronym. Then we have multiple implementation which basically it comes up with guesses for the missing data, missing values in the same way that mean substitution does but it is just a lot more refined approach for doing so and it has been proven to work well. So there are ways of guessing the data that does not cause bias. In fact, multiple implementation corrects for certain kinds of bias caused by missing data. Then we have a related issue of sample selection. Sample selection can mean one of two things. It can mean either that some cases are observed and others are not depending on the phenomena that you study. A classic example is Hekman study of wage offers for women and whether a woman gets high wage offer or not of course determines whether they go to work or not. So if the job market will give you a low wage offer then as a woman you choose to stay home. This is a study from the 1970s. So the things might be a bit different now. The problem here is that because you don't actually observe the wages that the women were offered rather you observe the wages that they got then there is systematic selection because the cases that you observe depend on the value of the dependent variable and this is a problem. You can deal with this problem using a selection model. Another problem is sample selection or selection to the dependent variable. A classic example is a medical trial. If you allow people to select themselves in the treatment or condition only those people who are sick go to the treatment and those people who are not as sick go to the control group then if you compare after receiving the medication the treatment of the control the difference between the treatment and the control group is not only because of the medications effects but it's because the sick people decided to go to the treatment the people who are not as sick decided to go to the control. This is a difference in levels but there is also another kind of sample selection which is a more tricky thing to deal with and it is if the effect depends or varies between people. For example in management one of the key principles or key things that we believe in is that managers make informed decisions. So if we study for example which companies invest in factories and what is the effect of investing in a factory only those companies who think that they will benefit the most from the factory investment will invest in the factory. So the effect is not the same and managers make decisions based on the expected effect. If the managers are any good they can actually estimate what is the expected effect. And this kind of selection on the expected effect causes some issues that we need to deal with. The techniques that we apply for that case if we have these different effects we actually apply the same kind of selection model that we apply to the missing data problem. So in this way selection model or missing data and sample selection are related because the same tools and principles apply although this might seem as unrelated issues. I'll explain this concept in more detail in other videos but for now just need to understand the missing data. Some observations are missing that we would like to have. Sample selection can mean a systematic pattern of missingness. Either full case is missing or the dependent variable is missing or one of the values of the explanatory variables is chosen based on the expected gains from that variable for that case. This is a large and developing literature on missing data analysis. Unfortunately for applied researcher these literatures are technical but once you understand some of the basic principles then it's not that complex to apply these techniques.