 Kun käsittämme, että käytetään missinkin data-techniikkaa tai ennustajien aluksi, meidän pitää yhä ymmärtää, että se on seuraavaksi missinkin dataa, kun on seuraavaksi, jota ei saada missinkin dataa. Jos seuraavaksi on minua, niin usein missinkin data-techniikkaa ei ole tarkoittava. Joten mitä on seuraavaksi missinkin dataa? Seuraavaksi on koko aivan eri asioita. Ensimmäisenä on seuraavaksi missinkin dataa, joten jos on vain yksi osa yhä 1000, kun on missinkin dataa, niin seuraavaksi ei ole seuraavaksi missinkin dataa. Jotenkin, kun missinkin dataa ei ole tarkoittava, missinkin dataa tai missinkin dataa on toistaiseksi. Se on koko aivan eri asioita, jota missinkin dataa ja mekanismi. Joten, miten missinkin dataa on? Ja ymmärtää, että koko aivan eri asioita, me saamme koko aivan esiintyä. Mutta koko aivan, mitä Wurridge on tosiaan tämän topicin. Wurridge sanoi, että jos dataa missinkin dataa on random, niin asioita ei ole niin eri. Mutta tämä on vähän missinkin, koska jos ajattelimme missinkin dataa ja terminologi, mitä Wurridge on toistaiseksi, on missinkin aivan random. Jos ei ole missinkin työtä tai kymmenen valoja, joten jos missinkin työtä on puolet randomin process, niin mitä tapahtuu? Se on basically, että työtä maalaiset olenkin maalainen. Voit käyttää listan kanssa, ja jos työtä maalaiset on aivan työtä, jota maalaiset olen, niin oletkään niin conferences. Joten mitä on tämän, jos dataa missinkin on aivan random, on se, että työtä maalaisen on maalaisen, Jos esimerkiksi on aivan vahvistu, niin ei ole todella erittäin vahvistu. Warwich menee vahvistuun puolesta missään tekeminen. Se on se, että seuraavassa käyttäjien tekeminen tai se, että vähköisillä tekeminen on pieni ja se voi olla erittäin yksittäin yksittäin. Tämä on johtuun, mutta myös johtuun. Jos sinulla on paljon missään tekeminen, sitten allegations against under missing, complete and random can be nontrivial. For example if you end up dropping half of your data, because of missingness, you probably should be using modern missing data techniques. Also a word it says that these are difficult to apply and that is also partially true. Multiple imputation can get very complicated. Se on yksi yksi muodon ja missinkata teknikin. Mutta usein maximaleikkoon estimasioon, jota on missinkata, on todella yksi yksi täysin käyttö. Se on yksi tärkeää, että se on yksi tärkeää, kun on suunnittava käsimääräinen software, ja sitten software on käyttänyt täysin teknikin. Joten usein niitä teknikin on, se on todella yksi täysin teknikin, mutta jos on vain käyttänyt maximaleikkoon estimasioon, jota on missinkata, niin se on todella yksi täysin. In fact, some structural leakage and modeling software default to using this missing data analysis, because there's really not many downsides to it. So what are the consequences then? Will it further explain that the consequences depend on what does the missingness depend on? So we have to understand that both these scenarios are exogenous sample selection and endogenous sample selection. These refer to either missing not at random or missing at random mechanism. And the difference is whether a missing value in a variable depends on the value of that variable itself. But these scenarios that were its addresses are about least-wise deletion. So if there is missingness in any of the variables, then that case will be dropped. Like all values of that case will be dropped. So there is really no difference between missing not at random and missing at random if you apply least-wise deletion. If you are using modern missing data techniques, which use the information from those variables that you have to estimate those variables that you don't have, then whether it's missing not at random or missing at random makes a difference. Interestingly, if you have X variables or exogenous variables, variables that you observe and variables that you assume to be fixed, then missingness on those variables does not have an effect. If your missingness depends on the dependent variable, then that is a lot more problematic. This is a general rule. Missingness in a dependent variable is more problematic than missingness in an independent variable. If you have a latent variable model where you have multiple indicators for exogenous latent variables, then the missingness will be in those indicators. Because the indicators depend on the latent variable, they are considered Y variables in this kind of analysis. So if you have a latent variable model, then missingness in any of these variables that you observe is generally problematic. Unless the data is missing completely at random. Let's take a look at an example of how this works. So this is our full data. We have 1000 observations of X and Y. Y is the dependent variable, X is the independent variable. We have two regression lines here. The red line is the population regression line, and the coefficient is 1, slope is 0, and the red line is the estimated regression line from this sample of 1000. The lines overlap almost perfectly, because regression is pretty precise when you have such large sample sizes. What will happen when some of these data is missing? If the data are missing on X, and again we don't know whether this is missing, not that random or missing at random, because we are dropping both X and Y for those cases where there is missingness. The difference would be that if the data is missing on Y, but it depends on X, then we would have missing at random, but not missing, not at random problem, because we can use the Y, the X value used to predict the Ys. Okay, so what is the problem here? What is the consequence? We drop all cases where X is less than 0, so all of that is missing. There is systematic missingness with respect to X, and we can see that the consequences are not that severe. The regression line, the blue line here, the estimated line is still going to be approximately correct. What we lose is precision, so the red line and blue line are not as well aligned as before, but with 500 observations remaining, probably wouldn't be a big problem. So dropping based on X values, if the missingness depends on X values, that causes inefficiency, but it does not cause inconsistency or bias in regression analysis. If the missingness depends on the Y values, then things are a lot more complicated, a lot more severe. We can see that if we drop, using the same rule, we drop everything on, when Y is below zero, the regression line is going to be biased, the slope is too small and the intercept is too high, so the results will be inconsistent. Why is this the case? You can think of it basically because this missingness here depends on an unobserved quantity. So the Y's, they are the fitted part calculated based on X and the unobserved part, they are the error term and because we have missingness that depends on an unobserved variable, the error term, therefore we have a problem. That's one way to understand it. But if you simply look at this graph, you can see that those observations that have a large negative value in the error term are more likely to be dropped than those observations that have a large positive value in the error term and this causes the error term to be correlated with the X and that causes an endogeneity problem. So missingness on Y is more severe than missingness on X in regression analysis. Now what are the consequences for different techniques under different mechanisms? I will be talking about four different techniques. The list was the list, which is basically dropping a case if there is even a single missing value for that in any of the variables that we have. So that is the default in many statistical software regression analysis only uses complete cases and discards the rest. Then we have mean inputation. Inputation is basically a technique where you take the mean of a variable and then you substitute that mean to all missing values of that variable. And this is sometimes recommended as a simpler useful technique for dealing with missing data in some entry level quantitative data analysis books. It's generally a really bad technique. Then we have modern missing data techniques. This includes multiple inputation and maximum likelihood estimation with missing data. And then we have selection models, particularly Heckman selection model, where we model the cause of the missingness and that's one equation. And then we have the equation that we want to estimate and we use information from the missingness equation to correct the equation of interest. I have another video of all these techniques. So what are the consequences? If data are missing completely at random, then the data are basically a smaller sample of what we would have if we had observed all the variables. In that case, list was deletion is consistent, but it's inefficient because we can get more precise estimates using these modern mixing data techniques. Mean inputation is always inconsistent and inefficient. So we will get results that are systematically incorrect and results that are less precise than what we would get using modern missing data techniques. When we have the missing at random, so the missingness depends on some of the variables, but not the missing variable, then list was deletion is going to be inconsistent and inefficient. Fortunately, modern missing data techniques are still consistent under this scenario. They are less inefficient than list was deletion. So having the full data, of course, would be the most efficient technique, the most efficient way of doing things, but we don't have the full data. So the modern missing data techniques are the best that we have available. Now, missing not at random means that the missingness depends on the missing value. Or if we have a missingness in the y variable, then that missingness depends on the values of the y. That is the most problematic scenario and list was deletion, mean inputation and modern missing data techniques are all inconsistent under that scenario. The selection model can be consistent, but it requires some strong assumptions about things that we cannot observe. So selection models can be useful when data are missing not at random, but they are basically trading one set of assumptions, the missing at random assumption, to another set of assumptions which depend on what kind of selection model you are applying. So these selection models are by no means silver bullets and applying selection models when their assumptions don't hold can be actually worse than using these modern missing data techniques that assume that the data is missing at random, but not missing not at random. So the consequences are basically depend on the pattern and they depend on the mechanism and they depend on which technique you apply. Generally, the safest thing to do is to apply these modern missing data techniques and it's not always as complicated as Woolridge implies.