 In the first two videos on structural equation models, I've covered some of the sort of conceptual background, the history, some of the key ideas. In this video we move to understanding some of the applications, some of the actual model fitting that goes on in structural equation modelling. And this focuses particularly on confirmatory factor analysis. So in this video I'm going to talk about the general idea of how we measure concepts using latent variables. And I'm going to contrast two approaches to using latent variables to measure concepts. The first is the more conventional or historically the main way of doing this using exploratory factor analysis. And I'm going to contrast this then with the more modern approach of confirmatory factor analysis. I'll then move on to talking about some of the ways that we go about actually fitting and estimating confirmatory factor models and some of the important procedures that we have to do. And I'm going to finish off by talking about some of the kind of extensions that we can take CFA into. Notably when we are modelling the means of latent variables as well as their relationships, their associations. I'll talk about the difference between formative and reflective indicators in CFA. A procedure called item-passing and also the situation which we may sometimes be interested in of fitting a factor model to variables which are themselves latent variables rather than to observed variables which is the usual case. And that would be called a higher-order factor model. So in the first video I gave a sort of pithy definition of structural equation modelling as being path analysis with latent variables. We can also think of this as really being a distinction between two stages or two parts of the modelling process. The first is where we want to get good measures of our concepts or our constructs. And then the second part is looking at the relationships between those measured constructs. So if you like the emphasis firstly on measurement and measurement accuracy and adequacy and then secondly moving on to look at the structural relationships between the constructs that we've measured. So again we saw in the first video that anytime we want to measure something in science and particularly in social science is that the measurements contain various kinds of error. That error can be random and or systematic. So what we want to do in our statistical approach to the data is to isolate the true score in a variable and remove the error. And this is really what we're trying to do using latent variables for measurements. So we want to decompose our X variables. X is what we've actually measured. And we can decompose that into the T and the E components. The T is the true score and the E is the error. And we need some kind of model to enable us to split the X into these T and E components. Now one quite straightforward and useful way of doing this is simply to add the scores across a number of different X variables. If we have say four variables which are all measuring the same underlying concept then we could just add those up and take a summed score. Now this has some benefits because the random error in each of those measurements will cancel out as we add items together. But it's a rather unsophisticated approach and in particular it gives equal weight to each item in the construction of the true score. And that's often something that we don't want to do. So another approach is to actually estimate some kind of a latent variable model. Now in understanding the ways that we do this in SEM it's useful to sort of go back in history if you like and think about an earlier approach to estimating latent variables. Now this isn't to say that exploratory factor analysis is no longer used, of course it is. But the more modern procedure of confirmatory factor analysis has some attractive properties shall we say compared to EFA. So the exploratory factor model is also referred to as the unrestricted factor model or an unrestricted factor analysis because as we'll see when we get to looking at CFA, CFA does place restrictions on the variance covariance matrix whereas EFA doesn't do this. EFA or principal components analysis is a similar technique. Finds the factor loadings which best reproduce the correlations that are observed between the observed variables in our model. So let's say that we have six questionnaire items that all measure more or less the same thing. They're intended to measure some concept that we're interested in. An EFA will simply kind of reorder the data in a way which has best accounted for the observed correlations between those variables. Now it does this in a way of producing a number of factors which are in EFA equal to the number of observed variables that we have. So this is really just a reordering of the observed data. We end up with the same number of factors as we have observed variables. So at this point in just this reordering EFA hasn't done very much in way of summarizing, of simplifying which is often what we're trying to do with a latent variable model. So we have the same number of factors as we have observed variables and all the variables in our model, the observed variables are allowed to be correlated with all of the factors. Now we need to get from this point of having the same number of factors as observed variables to retaining a smaller number that we're doing some job of summarizing rather than just transforming the observed relationships. And there are different rules for doing this. One kind of heuristic judgment would be to keep or retain a number of factors which is less than the number of observed variables that explain some satisfactory amount of the observed variance. So we might say we'll retain as many factors as are needed to explain 70% of the variability or the correlations between the observed variables. Something else that we have to do in addition to summarizing is to understand what the factors that are produced by the factor analysis, what they mean, what are they measuring. Now we do this by looking at the pattern of factor loadings between the factor and the observed variables. So we do this in a sort of inductive way. We work out what the factors are by looking at how they are related to the observed variables. Another thing about exploratory factor analysis is that there is no unique solution where we have more than one factor. And so we can rotate the axes of our solution in ways that can help us to see what the underlying structure is. And so rotation of axes in exploratory factor analysis is quite common. Now to give an example of what I mean by some of those previous points, here's some made up data. And we have nine observed items and these are knowledge quiz items that have been administered to a sample of children. And what we're measuring is some construct like intelligence or cognitive ability. Now if we were to apply an EFA or a principal components analysis to this data, then we would initially have nine components or factors which is the same number as the observed items. So the first thing that we would need to do would be to implement some judgment about how many factors to retain. Now in this case you can see that three factors have been retained in this model. And that may have been based on one of these heuristic guides around amount of variance explained or some kind of plot, like a scree plot. So once we've done that we want to know what each of these three factors is actually measuring. And we do that by looking at the pattern of correlations, and that's what are in the rows and columns of this table, between each factor and the set of items. So if we look first at factor one, we can see that the factor loadings or the correlations are high between factor one and the observed items which are measuring mathematical ability. So this is saying that if you have a high score, the higher your score on factor one, the more likely you are to get the item math one correct. There's a high correlation between your score on the factor and your score on the item. For factor two there are high loadings on the visual spatial items and low loadings on the other items. And for factor three we see this other pattern where it's the verbal items that have a high score and low scores on the other. So we do this inductive process of figuring out what the factors are measuring by looking at the correlations between the factors and the observed variables once we've retained a smaller number that we think is in some ways satisfactory. So this is a very useful procedure and has been widely used in social science for many decades but it does have some limitations. Firstly, EFA is an inductive, it's rather a theoretical procedure and that is something which in general we are less happy with in terms of the way that we build theory in quantitative social science. So we've got a situation where the data is telling us what our theory should be when generally we would prefer to do that the other way round. We would have a theory in testing against the data. Another unattractive property of EFA in similar techniques is that it relies on subjective judgment and heuristic rules about what's a large amount of variability to explain and so on. So there's a lot of room for subjectivity in determining what our model should be. And of course when we are analysing data of this nature where we have indicators of underlying concepts it's rarely the case that we have no theory at all about which concepts the different indicators are actually measuring. We've usually written the questionnaire indeed with a specific intention of measuring particular concepts. So actually the more realistic and accurate assessment of what's going on here is that we're starting with a theory and then we're assessing it against the data that we've collected. So the idea that we are going from the data to the theory is not generally an accurate representation of how this procedure actually works. So given that that is the case, given that we do have a theory about how the indicators are related to the concepts it's better to be explicit about that from the outset and then use statistical tests of those theories of measurement against the sample data that we've collected. So we can compare this approach of exploratory factor analysis with a confirmatory approach. So confirmatory factor analysis is also referred to as the restricted factor model because unlike EFA it places restrictions on the parameters of the model. It can't be therefore rotated. You can't rotate the solution, there is only one unique solution for the CFA. And the key difference now with CFA to EFA is that we specify our measurement model before we've looked at our data and this is sometimes referred to as the no-peaking rule. If we have a theory about how the indicators are related to our concepts then we should set that down a priori as our theory and then test it against the data rather than tweaking our theory as a function of the particular sample data that we happen to have. So when we do things in this way, in a confirmatory way, the key kinds of questions that we have to answer are which indicators measure or cause or are caused by which factors. Which indicators measure or are caused by which factors. And importantly, and this is the real distinction with EFA, is which indicators are unrelated to which factors. Remember in an EFA we say that every variable is related in some way is allowed to correlate with every factor. In CFA that isn't the case. We will say that the correlations or the covariances between some of the indicators and some of the factors is zero. We will make that as a parameter restriction. And we will also need to answer questions about the correlations between the factors rather than leaving that as a default assumption in the model. Here we have six observed variables, X1 to X6. Now the first part of the model will have produced six factors or components. So at this stage we've already retained just the two factors that we think explain enough of the variability between our observed variables. But what you also see here still is that there is a single headed arrow running from each of the two latent variables, to all six of the observed variables. So we are estimating a correlation between each factor and each of the observed variables. Now what we would be looking for in this kind of situation is that some of those loadings would be large and some of them would be close to zero. So if we look at Etta 1 for example, we might in an EFA context expect that the loadings between Etta 1 and X1 to X3 would be high, of say 0.7 or above in standardised form and that the loadings that run from Etta 1 to X4 to X6 would be close to zero. And the opposite would apply for Etta 2. So what we're doing there is I'll say estimating all of those relationships and expecting some pattern of high and low loadings between them. By way of contrast the same variables and the same two factors now in the form of a confirmatory factor model, now rather than having estimates for all of those relationships between Etta 1 and X1 to X6 and Etta 2 and X1 to X6, we say that there is no relationship between Etta 1 and X4 to X6. There's no arrow pointing from Etta 1 to any of those observed variables and the same for Etta 2. Arrows pointing at X1 to X3. So the fact that there isn't an arrow there means that in our model we are constraining those to zero. We're not just estimating them and saying are they nearly zero. We are specifying our model a priori to say that those paths are indeed zero. So those are the kinds of parameter constraints and parameter restrictions that I was referring to and talking about in video 2. That it's quite unusual in other branches of statistics that we use in social science to make these constraints and fix parameters to particular values. But that's why we call the confirmatory model the restricted factor model because we place restrictions on the loadings. So sometimes, as I just gave an example of, we would fix particular parameters to zero for indicators that do not measure, do not influence a measured variable. And the important thing to understand is that our theory of the measurement of our concepts, how we think the concepts are related to the indicators that we've selected and written if they're a questionnaire item, that that theory is expressed in the constraints that we place on the model. So we're not just estimating everything, but we are placing restrictions on what the parameters, the values that parameters can take. And those restrictions, those fixing of parameters, they over-identify the model. So we are placing restrictions which give us more degrees of freedom in our model, which enable us in turn to test the fit of our model compared to the matrix that we've actually observed, S, the sample variance covariance matrix. Another way that we apply restrictions to the parameters in a confirmatory factor model is to give the latent variables a metric. Now, what I mean by that is that if we have a measured variable, we will have specified some kind of scale for respondents to answer on. So maybe it would be strongly agree is the value one and strongly disagree is the value five. So the scale is one to five for that measured variable. For a latent variable, we don't have any metric. It is an unobserved variable. It's a hypothetical variable. So it doesn't have a metric on its own. We have to give it one. And there are two ways that this can be done. The first is to essentially produce a standardized solution so that all variables are measured in standard deviation units. This can be done by constraining the variance of the latent variable to one. And this has some benefits, but the downside, of course, is that we no longer have an unstandardized solution if we require all latent variables to be measured in standard deviation units, then they don't have any retention of the unstandardized metric that they could be given. So the second approach is to constrain one of the factor loadings to take the value one. And by doing this, we take the scale from that particular item, which we'll call the reference item. So if we fix the factor loading of a particular item to one, then that will be the reference item and the latent variable will have the same scale as that item. So if it's measured again on a one to five scale of strongly agreed, strongly disagree, then the latent variable will be on a scale of one to five. If it's a one to ten scale, the latent variable will be on that same scale. Now, this is generally preferred to the first approach of having a fully standardized solution because we can also get the standardized solution using the second approach of fixing one loading to the value one. And we also get the standardized solution in that approach as well. So in confirmatory factor analysis, we are interested in making good measures of our key constructs, concepts in our theories. And we are then in the next stage usually going to move on and look at the relationships between the measured concepts. And so conventional SEM is focused on the structural model, the relationships between concepts. So we are not so interested in the means of the observed or the latent variables. And in, as I say, the conventional way of doing SEM, that isn't a focus. The focus is on covariances and correlations, relationships between the variables. But there are occasions within a SEM context where we would be interested in the means of latent variables. There are two main areas where we would want to estimate latent means. The first is where we want to see whether there are differences between groups on a latent variable. And secondly, if we're interested in change over time, perhaps if we've got a longitudinal data set, we would want to estimate the mean of the latent variable and see whether that is changing over time. So when we introduce means into our CFA, then we do this by adding a constant to the model. Actually, when you fit models in modern SEM software, this isn't a choice that the analyst has to make. It is, if you like, done underneath the hood. But this is the process that is actually implemented, is to add a constant which has the same value one for all cases in the model. Now, the regression of a variable on a predictor and a constant will give us the mean of that variable in the unstandardized beta of that regression. And the mean of an observed variable is the total effect of a constant on that variable. So the total effect, as we saw in video one, is the sum of the indirect and the direct effects. So if we now introduce a constant which in path diagrammatic notation is represented as a triangle, and here we have the number one inside the triangle to indicate that the constant is one, then we, in this path diagram, have again a y variable and an x variable. We have a direct effect from the constant to y, which has the coefficient a. We have a direct effect from the constant to x, which is b, and a direct effect from x to y, which is c. So the indirect effect of the constant on y is the product of b and c. So by adding in this constant, we can estimate the mean of x, which is simply the coefficient b, and we can estimate the mean of y by taking the sum of a and the product of b and c. That's the total effect, the sum of the direct and the indirect effects. So that's how we introduce means into our model. Now, if we've added a mean structure in, then we will require some additional identification restrictions because we're now trying to estimate more unknown parameters, that's the latent means. So there is a question then about how we estimate and compare one mean to another. And the way we do this is by having multiple groups. So where we have more than one group in our sample, then we can fix the mean of a latent variable in one of those groups to be zero. And then the means of the remaining groups on that latent variable are estimated as differences from the reference group. So with mean models in CFA, one of the groups always has to have a restriction that their mean value is zero. And then the other groups are interpreted in terms of differences from that reference group. When we've looked at path diagrams and thought about the relationship between concepts and indicators between latent variables and observed variables, the arrow will be pointing from the latent variable to the observed indicator. So what this is saying in theoretical terms is that the latent variable causes the indicators. That's why the arrow points in that direction. So we can think of that as meaning if we're trying to measure, let's say, someone's social capital and we've asked lots of questions in a questionnaire, that what's actually causing their answers to those questions in the questionnaire is their underlying level of social capital. So the causal arrow points from the latent variable to the observed indicators. Now for many concepts, that direction of causality makes sense. In other contexts, the idea that the causality flows from the latent variable to the indicator doesn't really make sense. So let's think of an example where we want to measure socioeconomic status. And we're going to use indicators of someone's level of education, what kind of occupation they have, their earnings and so on. And we want to combine these somehow into a latent variable that measures their socioeconomic status. Now what's problematic about this in the reflective indicators context is that it doesn't really make sense to say that I have some underlying socioeconomic status and that if that were to change, then my educational level would change or my earnings would change or my occupation would change. Because actually causality is flowing in the other direction if there is any causality going on here at all. So someone's level of education influences their socioeconomic status as do their earnings. So now we're in a situation where the causality makes more sense to flow from the indicator to the latent variable. So the key point here is whether manipulating, if we could somehow change someone's score on the latent variable, would it make sense to change the score on the observed indicator? Now for some concepts that make sense, for others it doesn't. And in the case where it doesn't make sense we would essentially turn the arrows round and make the arrows point from the indicators to the latent variable. And in this context we've now got what we call formative indicators rather than reflective indicators. Now as I said it's a different sort of latent variable now that we're dealing with. It's essentially a weighted index of the observed indicators and it doesn't have a disturbance term. There's no error in it. So it's not the same kind of variable as we would have with a reflective indicator. The key thing is that in the path diagram the arrows point from the indicator to the latent variable rather than the other way round. There are of course some quite different procedures for estimating this kind of a model but for now the concern is to understand the conceptual difference and the fact that we have the indicators related differently to the latent variables. Another common procedure in confirmatory factor analysis is when a researcher may have a very large number of indicators for a latent construct or for a number of latent constructs. This is quite often the case in psychology where there are quite complex latent variables and each one maybe has 10, 12 or more indicators. One of the problems that researchers run into with this kind of data is that the model can become extremely complex very quickly and there's lots of difficulties that people can run into with estimation and interpretation and so on. Simply because there are so many relationships in the observed data because there are such a large number of indicators and latent variables. This is often combined with sometimes quite small sample sizes which can add to this problem. When in this situation researchers will sometimes use an approach called item parceling which is a first stage of taking some scores adding up the scores for those large numbers of items or for subsets of the subgroups of those items and then those subgroups of parceled items of some scales then act as the observed indicators for the latent variables. This is a parsimonious way of treating rather complex data. It does rely on some assumptions about the unidimensionality of the items in that parcel but it is an approach that researchers who are in that context of having lots of indicators for their latent variables and large numbers of latent variables can pursue. Lastly I'm going to talk about a kind of confirmatory factor model where the latent variables are not measured by observed indicators but are themselves measured by latent variables. So we have a sort of hierarchical structure where a first set of latent variables are measured using observed indicators. We have to have observed indicators at some point in the model and once that first set of latent variables are measured then a higher order factor can be added which is a function of the first stage latent variables. Now this is an approach which is often useful when our theories are not so much about the relationship between variables but are in the dimensional structure of the data. In psychology there are debates about the number of personality dimensions and often belief systems and so on. It's important to understand how many different dimensions there are in addition to how those dimensions might be related to other variables. So intelligence, personality and so on, higher order factor models can be useful. They can also be applied in a longitudinal context. So here's what a path diagram for a confirmatory factor model with a higher order structure would look like. We have at the bottom of the diagram now the observed variables in rectangles. There are nine of those and each set of three is measuring a latent variable and then the highest level variable eta1 is then measured as a function of those three latent variables. So in this third video I've looked at some of the important issues in confirmatory factor analysis started off by looking at the general idea of using latent variables to measure concepts in our theories. I've contrasted the historical approach, the conventional approach of exploratory factor analysis or the unrestricted factor model to the more modern confirmatory factor model, the unrestricted factor model. We've looked at how we can give a metric, a scale to latent variables by fixing one of the indicators to take the value one and therefore take the scale from that reference item. We've thought about how we can analyse means within a confirmatory factor model. Usually we're mainly focused on associations, correlations, but we can also estimate means. We've looked at some special cases where we have formative indicators rather than reflexive indicators where we have a first stage of item parceling when there are many, many indicators and a large number of latent variables and we finish by the special case of a higher order factor where a latent variable is measured not by observed items but by lower level latent variables.