 Confedera factor analysis models can be useful for addressing method variance concerns. However, for using these models effectively we need to consider identification, and also we need to interpret the results properly. In this video I'll talk about identification of these models. So the kinds of models that we're talking about are these method variance models. So we have basically indicators that load on the factors that present the constructs of interest, and also a source of method variance. Whether a single source is realistic or not, is a question that I'll ask, answer in the end of the video, but for now let's assume that this kind of model is useful. We can have two versions of this model. We can have the one with a method variance that is not measured, and then we have another one where the source of method variance is measured. For example, social desirability would be measured, and then we model social desirability as a cause of the variation in the items, or we can have a third option where we have marker indicators. Articles about these models, particularly these method variance unmeasured, latent method factor model, note that they are potential identification problems, but these articles don't really go and explain what the identification problem is about. So basically research have noted that when using this model, the software tends to produce warnings, or tends to not produce standard errors, or other kinds of indications of non-identification, but the identification status of these models really has not been addressed well in the literature. So let's take a look at the identification of this kind of model. This is the unmeasured latent method factor model. So we have here in our example model we have three constructs that influence the indicators, and then we have a source of method variance, and then we estimate this kind of model that uses these latent variables to represent the constructs, and this single factor to represent the source of method variance. There's another variant of this model. This is the marker variable model. So we have a theoretically unrelated construct that we measure. So construct C here would be the marker variable, or marker construct, and these would be the marker variables. So the idea with the marker variable was that if we have a construct that is assumed to be uncorrelated with the key constructs in the study, the A and B here, then the only reason for correlations between these items of these interesting constructs and the marker construct is because of the shared source of method variance. And we have the measured latent method factor model. This model differs from the marker indicator model in that here we have X7, X8 and X9 would be, for example, measures of social desirability bias, and then the method factor would present social desirability bias as a social method variance. So it's a directly measured measurement artifact instead of being a proxied by a marker variable. So these are method variance indicators. This model is identified, and it's identified because it's what I'd and quarters called BIPACTOR S minus I model. So this is identified, and the identification is explained in this article. So we don't need to talk about this model in detail in this video. So we'll just focus on this model with and without these correlations. So if we free these correlations, then we have an unmeasured method factor. If we constrain these correlations to be zero, then construct C is a marker construct, and this will be a marker variable model. Typically what researchers do is to fix these indicators to be the same to avoid identification problems. These identification problems are commonly not explained in articles that apply this technique. So whenever you have an identification problem, you should explain what specifically is not identified in the model instead of just checking that, okay, I got a warning, let's fix all these the ones. So all these kind of decisions should be justified instead of just done based on convention and just done to get the warning to go away because there are different ways that you can address identification. And this is probably not the ideal way, at least not for every possible model. So is this identified? This is the least constrained form. And so we have the scales are set by fixing the first indicators to load with one and same for the method factor, and we are also not constraining the method factor indicators. So we could argue that this is identified because it's kind of like an exploratory factor analysis. So it's an exploratory factor analysis with four factors, and we have a rotational criterion. So A is rotated to be uncorrelated with X4 through X9, B is rotated to be uncorrelated with X1, X2 and X3, or not uncorrelated but do not have an effect. And then the method factor is rotated to be uncorrelated with any other factors. So we could use this kind of reasoning to claim that this is actually identified. Whether that reasoning is correct, I don't know, it sounds reasonable, but if someone would point give me that kind of reasoning, I would ask for more evidence. So let's take a look at the identification status of this model from a different perspective. So identification can be proven. And here are the covariance equations. I've omitted the variance equations for simplicity because they're just used to solve the error variances. So can we solve all the model parameters from this set of equations, assuming that we know all the population covariances? That is the question of identification. We have 36 covariances and we have 15 degrees of freedom because we have 21 different parameters in the model. So is this identified or not? We could start solving this set of equations, but it gets very tedious and it would probably take at least for me several days to figure out whether this is identified or not. And I'm not sure if I would be even able to do it. So if something goes beyond your skills, then improving identification, then there are other strategies that you can apply. In my video on identification on structuralism models of observed variables, I know that there are empirical ways of checking identification. So we're going to be doing this empirical checks strategy for identification to show that this model is identified, but also we'll show that there are, that that's not the full story that there is to the identification of this model. So what we do is we first start with empirical strategy. We just estimate the model and I'm going to use, I will use the whole thing as fine for data in R and I have a visual texture and speed factors, each measured with three indicators and then I have an unmeasured method factor on which all the indicators load. We got a warning. So is that indication of non-identification? Well, warning is something that you always need to take seriously, to understand what is the source of the warning and in this case we need to check out the coefficients. We can see here that the first factor loadings, the X3 converges to a very large number and that indicates that if one of the factor loadings starts to converge to a very large number, then using that as a scaling indicator for the factor might be a good idea. Why it would be a good idea becomes clear when I actually switch the model to be using the indicator number three as the scaling indicator. So what we do is that we free the loading of the first indicator, we set it to NA, which is freeing it and then we constrain the loading of the third indicator to be one. And now we estimate there is warning, and this is a Haywood case, okay, so we need to take a look at where is the negative variance and how large it is. But generally we got a convergence, we got a chi-square statistic that is close to non-significant, so pretty good. When we look at the actual coefficient estimates, we can see that we have standard errors, which is good, so we should have standard errors for identified models, not having standard errors or getting a warning is a sign of non-identification. There are no extremely large estimates, so all the estimates are reasonable, they are in the same ballpark, which is often a good thing to have. But these loadings look weird, so the loading of X1 is negative and that is the reason why it did not converge initially, because we constrained that loading to be positive by constraining it to be one. We are basically saying that the model implied correlation between X1 and X4 should be, so it's visual and textual correlation is negative, so this implies a positive correlation, but our previous model implied a negative correlation, which didn't fit the data. So when you constrain a factor indicator, which is actually loading to be one, that fixes the magnitude and also the sign, and the magnitude gives you the variance of the latent variable, but the sign can be incorrect. So if we were to fix the X1 to be minus one, then this would converge as well. But now the problem was that our X1 wanted to have a negative sign, we constrained it to be positive and that caused the model to not converge because of misspecification. So these look a bit weird. So if we're saying that X1, X2 and X3 should be loading on the same factor and then we are seeing that X2 does not load at all, but based on existing theory and existing validation of the scale, we should expect X1 and X2 both load positively on this factor. Then this would be a cause of concern and if I would see this kind of result, I would ask the authors to address empirical identification of this problem model because these don't sound like they would be correct or look correct. Then we have the Hayward case. It is more estimated, it's not significant, so it's possible that the error variance of X3 is simply very small and this is just a result of sampling error. So this is not something that we need to be concerned about. If the Hayward case was large, so we have a highly significant negative variance, then that indicates model mispecification. Okay, so identification, we passed the first check. Also we got some weird results, we did not get any errors. Then there are other tests that we can use. So different starting values. Identification basically means that the solution to the model is not unique and what solution we arrive to depends on the starting values and if we have tried multiple different sets of starting values, we get the same estimate from the model that indicates that the model is probably identical. So we use different starting values. I'm using 0.5 and 1.5, 0.5, 1.5, just to set something that is reasonable close to 1 but has lots of variations and then we estimate, we get exactly the same results if we take the absolute difference, it's in the sixth decimal, roughly. So we get different starting values, give us the same result, this indicates that the model is probably identified. Of course this strategy should be done using multiple different starting values just to be sure that this is not like a local optimum on which the optimization lands. Then we can do something else, we can estimate using model implied covariance matrix. So we estimate here and we have the implied covariance matrix from the first model and then we estimate using the implied correlation matrix. We take a model summary, so we compare the implied, estimates from the implied matrix and estimates from the observed matrix, we see that they are the same. So they differ in the third decimal which is not a big deal. One thing where these sets of estimates do differ is model fit. So a particular chi-score here is exactly 0 and the reason is that the implied model fits the data that produces the data that is perfectly consistent with the model. So there is no misfit. So we are basically estimating the same model except that there is no misfit. But these estimates are the same which shows that the estimates should be unique given our covariance matrix. So if you get the same estimates for the implied matrix then that's an indication that there is identification. Then we use strategy 4, estimate using simulated data. So I'm simulating a data set from a set of matrices or I'm actually estimating from a population covariance matrix. I have the lambda matrix of factor loadings, I created here. And these loadings are 0, 1.2, 1.4 on all the main interesting factors and these are from 1 to 1.8 for the method factor. Everything is correlated at 0.3 except the method factor is uncorrelated in the psi matrix which is the factor correlation matrix. And then the error variances are all 1s, all errors are uncorrelated and we get the sigma matrix, the population covariance matrix by multiplying these matrices together and getting adding the error variances. We set the column names and row names and this is our population covariance matrix. Now when we analyze this population covariance matrix we should get the factor loadings 1, 1, 1.2, 1.4 and factor correlations of 0.3. So let's estimate using this sample covariance matrix. So we use the sigma, the regular population covariance matrix and we get the correct estimates. So we can recover the correct estimates and that indicates that our estimator is probably consistent and that indicates, of course, identification. So, all good. Did we just prove that the concerns of identification problems that many articles mention are actually not a concern at all? Well, not really. There is actually more to this story. This model is identified but the identification problems actually concern empirical identification. So this is identified for some values of the population covariances but it's not identified for every possible set of values. So that is the problem of empirical under-identification. And let's see what happens when we try yet another set of starting values. So I'm using the starting values 1.5, 0.5. Again, I'm just using them in a slightly different configuration. We get a warning, that's okay. So that's a Haywood case. Nothing to be concerned of at this point. When we compare the estimates, they are actually not the same. Some of them are close to one another. For example, 0.86, 0.807, that's close. But 0.475, 0.29, that's not even close. 0.493, 0.641, not close. So we get different starting values give us different sets of estimates. And this is a big problem because it tells us that the solution that we got is actually sensitive to the choice of starting values. And the choice of starting values, there are algorithms for that but it is more or less arbitrary and it's not guaranteed that we actually got a unique set of estimates if we get two models, two solutions that fit equally well but give us different solutions. The model fits nearly as well. So this third model is basically, it's a failure of optimisation. So optimiser found as a solution which it thinks is the best solution but the original solution actually was a bit better. Of course, how do we know that there exists a better solution than the one that we just got? We only know that because of trial and error. So why is this happening? And importantly, in the third model that we estimated, there are no warnings. So if we just estimated the third model, we would have gotten different results than the first two models, slightly worse fit and we would not know that there's actually a problem because there are no warnings. And the reason why this happens is that this model is nearly empirically non-identified. So we are very close to the condition of not being able to calculate the estimates because some of the covariances are the certain values that prevent us from estimates. And to understand the empirical identification of this model, you need to understand the empirical identification of by factor models. So let's try simulating from a different set of starting of population values. So let's assume that all these factual orders are the same and we use the same psychometric matrix as before. All indicators, all interesting factors are correlated at point three. The method factor is uncorrelated, all variances are one and epsilon has variance of one. We calculate sigma, the population covariance matrix the same way. We estimate from the population covariance matrix, we get an error. If we take a look at the actual estimates, we can see that incorrect estimates from the population matrix. So we had a full population matrix and we know that from that population the factors are correlated at point three because that's how we created the data. But then this happens, so we get incorrect results. This means that the estimates are inconsistent. If you can't get it right from the population, then your estimates are inconsistent. So why does this happen and under which conditions would the model not be identified? So the empirical identification problem here is basically that we have only three unique values in the population covariance matrix. We have point three, two point oh and three point oh and we are trying to estimate 30 different parameters from just three unique values. That can be done. So this is a big problem. Identification requires that the loadings are different. So if the loadings are the same, then the model is not empirically identified. If we estimate from a sample, this shows us something interesting about this problem. So we cannot estimate from a population and we will get a warning that the model is not identified. But what will happen when we estimate from the sample is actually something different. So I'm generating a set of normalities with random numbers using seed one, two, three, four, five and 200 observations. We estimate from that sample there are no warnings. Everything went fine. The model fits well. We get estimates. We have standard errors. But the problem is that these estimates are not close to the correct values and the standard errors are large. So we had a model here that if we have the full population, we cannot know how much the factors are correlated. If we have a sample here from that full population, we get no warnings. So we have an inconsistent estimator because we can't get it right using the full population. But when we use a sample, there is no indication of this problem. And you can think of how frequently would a researcher notice that there's actually a problem in the model for the population. In the case that they would get completely defined looking estimates that are nevertheless incorrect. Because you of course wouldn't know what the correct population values would be. So why is this empirical identification of this model a big problem? It's a big problem because typically when we design survey instruments, design survey questions, we try to make all the questions about equally good. So we like to think that the items are interchangeable. So they are equally good. We try to go for tau equivalence, which is that they share the same true score. And that of course means that the loadings are the same. So we try to, if we do a really good scale development study, that should produce a scale that is empirical under-identified using this model. Basically, to get this model to work, you would have to design your scales in a way that some items are bad by design. And that doesn't sound like a reasonable approach. So nearly empirical under-identified populations or empirical under-identified populations is probably very common when using these kind of models. In sample data, the sampling variation can provide identification. But this is just artificial identification. It doesn't really mean anything. You just get some random noise that allows you to estimate, but it doesn't allow you to say anything about the population values. Of course, if you can't use the population values, you can't say anything about the population correlations, don't allow you to say anything about the population values, then by adding sampling error there, you shouldn't be able to say anything either. But nevertheless, the software gives you a result without warnings, but you just can't trust the results because it simply happens to be identified because of sampling error in the data. You get no warnings, and you get potentially misleading results. So this is a big problem. Also, empirical under-identified models do not always converge. And this is the thing that the literature refers to as convergence problems of these models. So what do researchers actually do when their model does not converge? The typical thing to do is to fix the loadings to once for the method factor. But this is a bit of a problem because the identification problem is not whether we can identify the relative magnitudes of the method factor loadings. The identification problem is more about that we cannot know if the items are correlated because of the method factor or because they measure constructs that are correlated. So fixing the factor loadings to be the same, while it can accomplish identification, it also produces a misspecified model and it addresses an incorrect problem. So fixing the problem, the loadings addresses an incorrect problem. It does not address the problem that you can't say if a high correlation is because of a method or because of highly correlated construct. Finally, there is not much research done on this issue. So I try to look for paper articles that talk about identification of these models and beyond finding articles that stated that there are identification problems without explaining what those are, I couldn't find any. So this is something that someone probably should take a look at at some point. The bottom line is that the results from unmeasured method factors should probably not be trusted. Even if you get a solution or a set of estimates without a warning, that does not guarantee that there is no empirical under identification in the population. If the population model is not identified, then you cannot say anything using the sample data regardless of what your software gives you. So these kind of models probably should be avoided. Now let's take a look at the marker variable model and this identification. This is a more defensible model. So this is the marker variable here and these are marker indicators and the marker variable, marker construct or marker factor are constrained to be uncorrelated with the interesting factors. We can actually prove the identification of this model and it's a lot more doable because we have these equations here are a lot simpler. So 1 and 7 are, indicators 1 and 7 are only correlated because of the method. So we have 36 covariances, we have 19 parameters, 17 degrees of freedom and how we would actually go about proving the identification of this model would be that we first solve psi mm, so we solve the variance of the method factor that allows us to solve the method factor loadings and once we have solved the method factor loadings we can solve for the markers, we can solve the remainder of the method factor loadings and once we have fully solved the method factor we can basically take whatever sample covariances remains after the method factor has been parceled out and estimate the factor model of the interesting factors from the residuals and that allows us to solve the full identification of the model. So if we can solve the variance of the method factor then we can basically approve that this model is identified. So I'm going to be first looking at these equations here and I'm going to be substituting these to these equations here. So we just try to start eliminating equations. Now let's focus on this set of equations here. These are the correlations between the interesting indicators and the marker indicators. So we eliminated the correlations between all the marker indicators and now we have just the interesting indicators and the marker correlations here. We got 15 equations and 5 unknowns. So we have the method factor loadings and 10 degrees of freedom. So we have our over identification constraints. We can solve the method factor loadings here and then these remaining covariances give us over identification tests. So we can just plug in, they are lambda m2 here and that gives us a constraint that should hold for the data and then we can check if that holds. All right, so we can eliminate these constraints. We can substitute these equations on the left hand side and that gives us a bit more equations. Now we are trying to solve variance of the method factor from these remaining equations. I'm actually not going to do that because it gets a bit tedious but I'll explain the strategy how we would go about doing it. So what we would actually do is to choose 4 equations and I'm going to be choosing sigma 14, sigma 15, sigma 24 and sigma 25. This is 4 sets of equations with 4 unknowns. So we have sigma AB, lambda B1, lambda A2 and lambda B2 and sigma MM. So we have 4 unknowns there and 4 equations so we start solving. How we would do this is that we would first solve sigma AB as a function of sigma MM. Then we solve sigma B5 as a function of sigma MM using the solved sigma AB. Then we solve here sigma A2 as a function of sigma MM and then we plug in the solutions to sigma A2, sigma AB and sigma lambda B5 to this equation. That produces a third degree polynomial and those third degree polynomials generally have solutions and that establishes the identification. Actually working through the math, working through the third degree polynomial gets very tedious so we will not do that. So this is identified but is it empirically identified? So what is the empirical identification problem in this model? Well we face the same problem as before and so let's take a look. Now we have as before we have factors that are loading equally but now the third is a marker. It's uncorrelated with the first and second and we generate population covariance matrix. We estimate the population covariance matrix with equal loadings. We get this warning that the model is not identified and we have only four unique values, 1, 1.3, 2.0, 3.0 but we have 28 parameters so we can't really estimate. Identification again requires the loading statement. So what exactly is the identification problem? In the previous model with the unmeasured model the most important identification problem was that we did not know if for example X6 and X7 are correlated because of the method or because they measure construct B and construct C that are correlated. Now we know that X6 and X7 are correlated only because of the method but we have to ask why do X6 and X7 correlate and this X6, X7 correlation is basically a product of these two loadings and the variance of the method factor but we don't know whether the method affects more strongly to X6 or whether the method affects X7 more strongly. So we know that overall the effect of method on X6 and X7 is something because we know the correlation but we don't know which one X6 or X7 is affecting more strongly. So here the identification problem is that we don't generally know if the method affects these X7, X8 and X9 variables more strongly or if it affects these variables more strongly. So what can we do when we face the un- or identification problem? When we fix the method factor loadings to once here then we are actually solving the right problem because the problem was that as a set we don't know if there are X1 to X6 indicators load highly on the method factor or whether X7 to X9 load highly on the method factor so we didn't know the relative magnitudes of those sets of loadings. If we fix all variances to one or loadings to once we address that identification problem. So identification problem is the effect of method factor or market relative effect of main indicators is not known and therefore this fixing address is the right problem. Again this is something that there is not much research done on this issue so articles on this method factor models mention identification problems but they are rather vague about the identification problems they don't go into the explaining in detail what exactly the problem is. The bottom line here is that these kind of models are defensible and if you suspect empirical under identification you can solve it by fixing all the indicator all the factor loadings to be once for the method factor but probably a better solution would be to fix for example are two factors to be the same two loadings so we fix for example the first loading of a marker and the first loading of the first interesting construct and then that allows identification so we don't have to fix them all and which factor loadings you fix you need to consider theory so which indicators you think are most strongly affected by the method variance or which of the markers are most strongly affected you don't fix those. The bottom line is that these can be useful but if you constrain loadings to be the same for identification purposes then you should clearly state that you are making an assumption and this model is probably an approximation of reality. Another strategy for addressing empirical under identification would be to use such a large sample that empirical under identification is not a concern or at least a person if you have a sample size of 10,000 then a person who says that your model is empirical under identified and you only get results because of sampling error that argument wouldn't be very realistic. Let's take a look at another example so this is from Spectra 2019 and they talk about the use of marker variables and they recommend that you actually have more markers and this is a good idea so they are using mood and negative effects so these are not really markers but they are directly measured causes of method variance so if you think that for example interpersonal conflict and physical symptom items are affected by different sources of method variance which they probably are because I would assume that people are less likely to say that they have interpersonal conflict because that's not totally desirable than to complain that they have physical symptoms which doesn't have the same kind of social disability bias so if you have multiple different measures that you can apply as markers then that's always better because it allows you to match the kind of bias that you're assumed to have with the marker that you think is affected by the same source of bias this model shown here would not be identified and I contacted the authors and they actually mentioned that these are mood and negative effects are directly measured variables so their scale scores instead of latent variables and using scale scores here would actually identify the model so conclusions on the identification of method factor models these models are problematic and for generally the modeling approach is problematic because a single source of method variance is often unrealistic so you would really have to explain these models why you think that single source is the thing so is it only social desirability that affects the items that would be a stretch if you have for example behavioral items and evaluative items in the same survey empirical identification is another problem the way to do with empirical identification is that you have a very large sample size if you get a result from that without warnings then empirical identification is probably not a concern so how should these models be used I don't think that the unmeasured method factor should be used unless your sample size is so large that you can rule out identification because of random noise even then it's a very risky strategy to rely on because you may have empirical identification if you plan in advance on using this model unmeasured method factor it's possible that you get a data set that the population actually is empirical and you can't really do anything about it then you need to redo the study recollect data with markers or measure sources of method variance to address the problem marker variables are more defensible but you should really not use just a single method factor and then add markers to the model but if possible you should be considering a more advanced model and then you should think through what are the sources of method variance instead of assuming that there is just one factor this article by Spectre talks about the choice of different sources of method bias and how you can actually think through your items and evaluate which sources are they affected by and then this article by Simmering talks about the marker variable choice both are essentially if you want to use markers then there is a issue of proper interpretation of these models that I talk about in another video