 Bi-factor models are an important tool for modeling scale dimensionality. Bi-factor models are also used commonly to address method variance issues, even though the term bi-factor models is not commonly used in that literature. One challenge with this kind of models is that their identification status is not clear. They are not always identified and they may be empirical under-identified. Unfortunately, as far as I know, there are no rules that you can apply to establish the identification of these models. So if you need to show that your model is identified, you must either prove it yourself or there are certain special cases that have been proven, but as a general rule there are no rules that you can apply. So that creates a challenge for an applied researcher. This challenge of identification has been acknowledged in the recent literature. This great article about Win and Young about empirical identification of this kind of models states that identification is one of the more challenging issues in empirical applications for fractal ecosystem modeling. It is challenging because identification is something that you need to establish by working through the math. So if there are no rules for identification, then you need to understand where the model implied covariances come from and then be able to solve from those model implied covariances to unique values of the parameter estimates. This is sometimes easier said than done. For the three indicator factor, this is simple to do, I showed that in another video for more complicated models, proving the identification or non-identification can get very tedious. In practice, people do identification checks empirically by trying to estimate the model and seeing what happens. I will talk about the identification through math in this video. One particular problem in identification of these models is that they may be also empirically under-identified. So it's possible that the model is identified generally, but for specific values of sample or population covariances, the solution to the parameter estimates does not exist. So the idea of identification or proving identification is that you work backwards from model implied covariances or sample covariances to the actual coefficients or parameters. And that gives you the parameter as a function of the covariances. Sometimes some of these covariances are zeros and you may end up with a solution where parameter value contains a division by zero. And that indicates that it's not identified because you can't divide by zero. And that is the problem of empirical under-authentication. A model can be identified for certain sets of covariances, but not every possible covariances. And that's the empirical under-authentication problem. So the by factor model was a model where you have minor factors. We have four minor factors here, each measured between five to three to five indicators. And then we have one general factor that loads on all indicators. And this model here is identified. I will not prove that. I will prove identification of another variant, the simpler variant of this model. But one important thing to understand about this model is the special case where we allow these minor factors to be correlated. So would the model be identified if these minor factors are correlated? Typically in a by factor model in a textbook example these minor factors are said to uncorrelated, but there is an important application of by factor models where these correlations must be freed. And to understand the problems of that application you need to understand first the identification of this by factor model, where the minor factors are constrained to be uncorrelated. The important application is method variance models. Briefly, the idea of method variance is that the variations in your model and variations in your measures are due to the measurement method as well as the constructs of interest. This is a concerned particle and cross-sectional surveys where a skeptic of your study could argue that the correlations between x-indicators and y-indicators is not driven by correlation between x-construct and y-construct, but it is driven by how people tend to respond on the left-hand side of the scale versus the right-hand side of the scale, given that the x-indicators and y-indicators are measured using the same scale format from using the same informant at the same case. So this is the method variance argument. And method variance issues are commonly diagnosed and to some extent addressed, or at least we try to diagnose and try to address the problem by using this kind of models. So this is a by factor model where we have one general factor that contains three unique indicators, and then we have two minor factors that are correlated. This happens to be identified, and I'll show the identification later, but then we have this unmeasured method factor. So this is a very common approach. Typically researchers who have not done much surveys, they do a survey study and then they send a survey study to review, and then the reviewer says that by the way you may have a method variance problem in your data, please address it. And the way that people address the problem is using this kind of model. So you have one factor that explains all the indicators. This is a bit of a problematic model. I'll explain this model in another video. But to generally understand one of the problems is that this is basically a by factor model where you allow all the minor factors to be correlated. If you were to constrain these minor factors to be uncorrelated, then you couldn't be testing your theory because your theory is that these constructs that the minor factors represent are correlated and that's what you want to estimate and test. Articles that talk about this modeling approach typically mention that identification of these problems is challenging. So this is in Pozhekov's paper. They say that there may be potential identification problems. Unfortunately, these same articles don't really go and explain the identification problem in detail. They just state that the model may not be identified without explaining why it is not identified and under which conditions it would be identified and would not. This is of course important to know because if your model is not identified, then your estimation will not produce any meaningful results. So in Pozhekov's paper they tell that, well, there is a disadvantage of identification that concerns the number of indicators. It is related to the number of indicators, but the number of indicators itself is not the reason for non-identification. So it's not that you would be running out of decrease of freedom. The identification problem with this kind of models is something a bit different. And then Pozhekov points out that people are solved this problem by fixing some of the factor loadings to be the same. That will identify the model, but it also produces a mis-specified model unless the factor loadings are known to be once in the populace. And of course estimating a mis-specified model just to get the unique estimates is a bad modeling strategy. If you estimate a mis-specified model, particularly in this case, the estimates can be really, really misleading. So let's take a look at why these by factor models are not identified or are identified and under which conditions. So we'll start a simple model, so we have two factors with three indicators, each and one general factor. So we set the scale by fixing the variances instead of fixing the first indicators because that simplifies the math a bit because now all the loadings are symmetric in the sense that the equation for solving the first loading is the same as the equation for solving the second loading. Instead of having to treat the first indicator as a special case because it's constrained to be one. So this gives a bit of symmetry to our analysis, but it would not be required to have this scale setting. You could easily, just as well, fix the first loadings and prove the identification status of this model that the equations just would not be as nice to work with. So in this particle scenario. So we have 21 simple covariances and we have 18 estimated parameters which gives us three degrees of freedom. So this model is, if it's identified, it is over identified. So we know that we have some constraints in the data but positive or decrease of freedom does not guarantee identification case. So we ask, is this identified? How do we know? Well, for identification we know that the the latent variables must have scales which we do and then we know that the three indicator factor is identified but that rule doesn't really apply to the by factor model. It applies to a set of indicators that load only on one factor. So how do we then know if this is identified or not? Well, we have to work through the covariances. So that's all model-implied covariances for this model and we can see here that we have actually three groups of covariances. We have these variances for indicators. Then we have these covariances for one minor factor covariances for between minor factors. We don't actually need to use this first set of variances because that is only used to solve the error variances. So if we can solve the loadings, then we can quite easily solve the error variances. We just put the loadings there and then we arrange them to the left-hand side and that gives us the error variances. These equations are not useful for solving the error variances of the factor loadings either because we can't know the error variances. So they are uninformative of the factor loadings. So we'll just be looking at this subset of the equations to simplify the problem with. And how do we then show if the factor loadings are identified or not? Well, we can just start solving this set of equations and we solve by substitution. So this is high school math. It gets a bit tedious because the number of equations is very large, but it's doable. So let's just do it. So we'll start solving. We'll take the first equation here and this is just a convenient solution to start with because this is a simple equation. It has only two terms compared to these other equations that have more terms. So we start solving using the simple equations. So we'll solve lambda g1 and lambda g1 is sigma 104 divided by lambda g4 and then we substitute this solved value for lambda g1 to the equations and now we have a set of equations where lambda g1 is eliminated and we have one less equation. So we eliminate the equations and eliminate the unknowns until we have just one equation with one unknown and then we solve that one unknown and then we start to plug the values of the solved parameters back to the models until we have a solution for every parameter. So that's lambda g1 solved. We take then lambda g2, we solve it and we substitute back. Then we take lambda g3, we solve it, we substitute and so this is, we have now eliminated the loadings of the first minor factor m1 here lambda, the loadings of the general factor on the first three indicators. We can then continue with lambda g4, so we solve lambda g5, we solve lambda g5, we substitute back and this starts to get a bit tedious but we can solve for lambda g6 now and we substitute that and now we have this set of equations here that only contain lambda g4 and four equations with one unknown. So that is over identified, if it's identified. Let's simplify these equations a bit and after simplification we can see that actually the lambda g4 disappears. The reason why it disappears is to be basically we have something multiplied by lambda g4 multiplied by something divided by lambda g4 and when you multiply and divide with the same value then that value disappears. So now what? We have actually these are constraints so if this model is correct for the population then these constraints should hold in the model so these are over identification constraints. Now we have a problem because we have four constraints but decrease of freedom is three. So if decrease of freedom is three and the model is identified then we know that there are three constraints but now we have four. So how is it possible that we have four constraints but only three decrease of freedom? It's possible because we actually have here six equations with seven unknowns. So these constraints will not give us any information about the unknowns so we have six equations and seven unknowns that cannot be solved which means that this model is not identified and we just prove the non-identification of this model. So this is the way that you can prove non-identification status, prove the identification status or non-identification status of your model and quite often if you have a hunch that the model may not be identified then proving that one part of the model is not identified is sufficient to prove that the model as a whole is not identified. Some parameters can be identified but typically working with these partially identified models is something that I would not recommend that you do because that requires lots of expertise to understand the results after that and you will get warnings from your software and to understand whether those warnings apply to the part that you know to be under identified or the part that you think is identified is challenging to do. So what will happen if we try to estimate this model using SCM software? So I'm using R and I have the hosting and swine for dataset that comes with law on software and I'm estimating this model. I have from the example model visual and text of three indicator factors and then we had a general factor we can see that the software gives us a warning that this is potentially not identified we get missing standard errors which is a symptom of non-identification and we know that these estimates should not be trusted. So what do we do after that? So if you want to estimate this kind of model but you don't get the standard errors and you get the warning what do you do? Well you have to first understand the identification problem that I just explained and then make informed decisions on whether you want to fix some of the loadings or whether you want to for example try adding more variables to the model which is not always doable. Let's try to understand why this model is not identified. So this is the sample covariance or the model implied covariance matrix for the model. So we have the parameters here and they form the implied covariance from X1 to X6 and what information do these between factor, between block correlations give us? So we can see from these X1 correlations that they give us information about the relative magnitudes of lambda G4, lambda G5 and lambda G6. So if we know that this X1, X4 correlation is twice as large as X1, X6 correlation then we know that lambda G4 is twice as large as lambda G6. And the same works for these X4 correlations. So the correlation between X4, X1, X2 and X3 give us the relative magnitudes of lambda G1, lambda G2 and lambda G3. These four covariances are actually redundant. So we could just as well shift this instead of looking at X1 we could be looking at these correlations between X2 and other variables or X3 and other variables. So these are redundant if we know the proportionality between lambda G1, G2 and G3 and the proportionality of lambda G4, G5 and G6 then these are just four testable constraints. So the problem here is that the relative magnitude of the sets lambda G1, G2 and G3 and the set lambda G4, G5 and G6 is not known. So we know the relative magnitudes of lambda G1 and G2 and G3 but we don't know which one is larger lambda G1 or lambda G4. So it's possible that these G1, G2 and G3 loadings are actually very large and these loadings for lambda G4, G5 and G6 are actually very small or vice versa. That's not we cannot empirically determine which set of these loadings is larger. So the correlations if X3 and X4 are only weakly correlated it's possible that X3 is actually highly loads highly on the g-factor and X4 loads only weakly on the g-factor or it's possible that X3 loads weakly on the g-factor but X3 loads weakly on the g-factor but X4 loads strongly. So we don't know which one it is and that's the identification problem. So these will not help for identifying whether lambda G1 or lambda G4 is larger. These correlations within the minor factor one indicators, minor factor two indicators don't help because we have 12 covariances and 12 parameters for these loadings. So this three indicator factor has six parameters so we have three loadings, three error variances we have two factors, 12 parameters and we already consume that 12 units of information by estimating the unknowns for this factor. So we don't have any excess information that we could be using for estimating the g-factor loadings. Let's take a look at a bit more complicated model and see if that is identified and why. So what if we have an eight indicator model? So we have the first minor factor, four indicators the second minor factor and another four indicators and general factor loading on all the indicators. Is this identified? Well we need to first establish a scale we again fix the variances we have 36 sample covariances 22 estimated parameters gives us 14 degrees of freedom. So if this is identified it is over identified but is it identified or not? In this particular case establishing the identification of the status of the model it's useful if we take a look at the subset of the model. So sometimes this is a large model so we have 36 sample covariances trying to type all those sample covariances on a slide would fill the slide with equations but we don't have to do that we can just take a smaller part and establish the identification status of that part. So let's take a look at the first five indicators we are ignoring information on these indicators and these indicators provide us information about the M2 loadings and M2 variants but we are not interested in improving that we just want to prove if the loadings for the minor factor one are identified if those loadings are identified then the loadings of the second minor factor are identified because of symmetry and if those factor loadings are identified then the general factor loadings are identified and so on. So if we start, if we can prove the identification of some part of the model then that typically helps in improving the identification of another part of the model. So we are just looking at these and why we can eliminate these indicators from the model is that these indicators don't have any observed consequences in the model. So we have all the causes of X1, X4, X3 and X2 and X1 included in the model and that is enough for us to prove identification. We cannot prove the identification of these variants so this is a unique source of variance for X1 now and X1 has also E5 unique source of variance so the variance is not explained by the g-factor in X5 we don't know based on these indicators whether that is due to the minor factor two or due to the error variance here but we don't need to. So we just look at these identification of these loadings. So let's take a look at the identification of this model. If we only work with the first four indicators so that's our model-implant covariances for the first four indicators this would not be identified. The reason for not identification is that this is basically an exploratory factor analysis model so we have two factors that both explain all the indicators and from our reading of exploratory factor analysis models we know that those models are not identified but we must establish a rotation criterion for identifying the model. So this is like an exploratory factor analysis with a rotation criterion and that is not identified. If we add the X5 indicator then we add four more covariance to the model. Instead of proving the identification which would be doable from these equations it's easier to consider the logic of why this is identified and the reason why it's identified is that adding this indicator X5 establishes a rotation criterion for this factor analysis model and the rotation criterion is that the loading of M1 on X1 must be zero so it's uncorrelated and that establishes the rotation criterion which is that M1 and X1 are uncorrelated and that establishes the rotation and identifies the model. This argument can be understood better if you understand how exploratory factor analysis can be conducted using confirmatory factor analysis framework and for example this web page on Stata's website explains you how you can run that explains the principles and shows an example using Stata so if you use that software this is a useful exercise to do. A model with three or more just identified minor factors would also be identified. I will not prove that but it basically follows from the identification of three indicator factors. So this is the problem in identification of this model this G factor is basically the same as the problem of identification for a two indicator factor. You can know whether the average loading or the product of three loadings for two indicator factor but you don't know which one is higher and that's the case with here in the G factor loadings if we have just three indicators but if we have a minor factor three or more then that would be identified because of the identification rules of three factor models three indicator factor models. Okay so that's the identification generally without any consideration for empirical identification. Let's take a look at another example. So this is the model that we just worked with so we have eight indicators, two minor factors, one general factor we just proved that this is identified and this is a model discussed by Green and Yang and they discussed the concept of empirical underline so their article has this kind of example they have a factor loading matrix here so all indicators load on a general factor and each indicator loads on a minor factor and one minor factor only. All the loadings are equal and all the factor loadings are equal. What will happen now is that we have just three unique equations with four unknowns so how is that possible? We have like this quite a few covariances here but only three equations. How is it possible? The reason why that happens is that these are actually the same so all the variances are the same because these factor loadings and error covariances are constrained to be the same and well these are the same. All these between minor factor correlations are the same and all these between minor factor correlations are the same. So we have just three equations but four unknowns we can't solve four things from three things which makes this not identified. Particularly it is empirical under identified. So empirical under identification are in a factor model context if you have a two indicator factor empirical under identification occurs if that two indicator factor is uncorrelated with any other coefficients, any other factors. In a three indicator factor if two indicators are uncorrelated then that leads to an empirical under identification problem. But this example shows that the empirical under identification can occur also for models where some of the covariances are non-zero. So in normal factor models zero covariances are problematic and by factor models also non-zero covariances can be problematic. So the fact that all your indicators are highly correlated does not mean that your model is empirically identified if it's generally identified. So what's the identification rule here? Identification requires that loadings differ and this is a bit of a problem because quite often when we design for example survey items we want to design the items to be interchangeable. So each item should perform roughly equally well and if that is the case then we run into this identification problem. And so what do we do about this problem? Well, there are some things but one thing to understand about the problem generally is that this empirical under identification can occur also if you constrain the loadings to be equal. So even if these loadings are different in the population and they will be different in the sample so that you don't have this kind of uniform covariance pattern if you constrain the loadings to be the same then the model would be empirically under identified which is kind of counterintuitive because normally when we constrain things in the model then the model becomes more strongly identified but in this case if we constrain the model then it can actually become empirically under identified and understanding why this under identification occurs is critically important if you work with this kind of models and particularly if you follow there are recommendations of fixing all factor loadings to be equal for the method factor if you use these method factor designs because that can actually lead to problems that you didn't have before. So what are the solutions to this identification issue? There is a recent article by aid and co-authors in psychological methods and they talk about different strategies for identifying this model and particularly they focus on two identification techniques one is that we have indicators of general factor that are not associated with any minor factor so we can see that these three indicators here that are only for the general factor they are sufficient to identify the general factor because of the three indicator factor rule once the general factor is identified then the remaining identification of the reminder of the model is very easy to show. Another way of doing this identification is to have at least one indicator so that's another way one indicator that is constrained to be uncorrelated with all other factors and that establishes a correlation criterion for the factor model and that's another solution. Whether these are applicable to your scenario of course depends on your theory so is it theoretically feasible to say that there is a general factor that influences some indicators and there are no other influences on those indicators for example if you would have these three indicators would be measures of social desirability and then you have two scales that you suspect could be affected by social desirability bias then this kind of model would make sense so you have a social desirability measured by three indicators and then social desirability also affects these indicators so that would make sense also embedding the by factor model in a larger factor analysis model helps with identification but that's not always possible but if you work with minor factors and buy factor models as a diagnostic tool in a larger model context then typically that would identify the model because your general factor only applies to one scale and not every indicator in your analysis but if you have a general factor that you model as a cause for every possible indicator in your model then identification issues are surely a concern for you