 Identification is an important property of a statistical model. Once we go beyond simple techniques like linear regression analysis, identification becomes a concern. What is identification? The term identification is used in two different contexts in the research methods literature. There is the mathematical identification sometimes a bit incorrectly referred to as statistical identification which basically refers to whether it is possible to arrive into unique estimates for the model parameters given our data. So here are two examples of non-identified models. So let's say that y here equals beta 0 plus beta 1 x1 plus beta 2 x1 plus u. We can calculate that the sum of beta 1 and beta 2 must equal the covariance between x and y divided by the variance of x. But we cannot really know which one of those beta 1 or beta 2 is larger or whether one of them perhaps is 0 or one is negative, one is positive. That is impossible to determine using this model. Another example is if we say that y, the variance of y, is a product of two sources of variance theta 1 and theta 2 that we don't observe, we cannot know which one of those theta 1 or theta 2 is actually the sort of variation or isn't both because they are unobserved. In other words, we have the variance of y that is observed. We observe one variance and we try to estimate two variances based on one variance. That just simply cannot be done. So that's the mathematical identification. Another context where the term identification appears in the literature is less common, it's the causal identification and causal identification can be understood to refer whether our research design provides sufficient evidence or valid evidence for a causal claim. So this is something different. This is not a mathematical problem. It is a research design problem. Some operational definitions of causal identification say that there are requirements for causal identification such that for each endogenous variable you have a sufficient number of exogenous variables that are guaranteed to be valid instruments for these endogenous variables. I'm not going to talk about causal identification in this video. I will focus on the concept of mathematical identification because it is fundamental to our estimating models. Causal identification is a fundamental issue to research design which precedes any attempts of analyzing our data. How is identification reported in the literature? So this is one way. Statistical estimation and identification are two different things. That's important to understand. Statistical estimation is about calculating the parameter values or estimating them from sample data. Identification concerns whether it is even possible to arrive into unique estimates regardless of our sample size. So if you have a model that is not identified then collecting more data will not solve the problem. Typically in statistical identification if you collect more and more data then the consistency of the estimator starts to kick in and you get less and less estimation problems. Identification is something that persists. So if we have variance of why that is the only variance that we observe and we try to estimate two sources theta one and theta two no matter how much data we collect we cannot do that. Here is client's take. So client explains that identification refers to whether it's even theoretically possible to calculate the unique value for the parameters in the model and if not then the model is not identified. Identification is related to decrease of freedom and if decrease of freedom is negative that means that we don't have enough information for estimating model. For example if we observe one variance variance of why and we want to estimate two variances variance of theta one and theta two we are trying to estimate two things from one unit of information and decrease of freedom would be minus one we can do that. So every time decrease of freedom the difference between the information that you have and the number of parameters that you want to estimate every time when it's negative then there is no hope of estimate in the model. So this is a good rule of thumb to remember when decrease of freedom is negative the results that you may get from statistical software are completely untrustworthy. Then we have the case of decrease of freedom as zero and this is called just identified assuming that the model is identified. I'll talk about that assumption in a moment. So here we have the equal number of parameters and equal number of things that we observe for example if we have x and we have y then we can estimate one relationship between x and y because we have one covariance but if we try to estimate a a bidirectional relationship that can be done we don't have enough data. So just identified means that our model should fit the data perfectly because there is no excess information. An important case and that applies to most models is over identification. In over identified models we have more information from the data that is required for estimation and over identified models can be tested. For example if we have regression model with decrease of freedom of zero that will always have an r square of exactly one and the model is just identified. We cannot test whether r square is zero in the population or not because we don't have decrease of freedom from doing that. If the decrease of freedom in regression is positive then we can run an F test for testing whether the r square is zero or not in the population. Many applications of covariance structure models or simultaneous equations models or structural equation models rely on this over identification and using this excess information for model testing. Now why is it there is a requirement that if the model is identified? It is possible that we have a model that has for example two parts. We have an effect of x on m and from m to y and it's possible that the first effect is over identified and the second effect is under identified and it's possible that one part of the model if we take that part only and we calculate the decrease of freedom has a negative decrease of freedom another part has positive decrease of freedom and those can add up to zero or a positive number. So just the fact that decrease of freedom is positive does not guarantee identification. If decrease of freedom is negative then that guarantees that we cannot estimate. If the decrease of freedom is positive and we have established that the model is identified then some parts of the model but perhaps not all parts can be tested.