 It's challenging to get people totally psyched up to think about statistics for two hours, and I'm very glad that you were gained to do so. I'd like to introduce Dr. Sam Wilford, who is a consultant to Apped Associates, which is our statistical consulting company. Sam has worked with both Benz and I on working on the validity and reliability of new scales that we've developed to do a variety of different things. And he's here today in particular. This is the second workshop he's done. He came two years ago. No, it was way longer than that. At least six years. And he taught structural equation modeling, which I can still say today I can't do, but I understand, which is really a nice feeling. So this time, a research team, and some of them may be on the phone that I was working with, and I was asked, what's the difference between an exploratory factor analysis and principal components analysis? And I sort of answered boldly, like I knew. And then we started looking it up because people weren't sure that was actually a distinction. And in fact, we found discrepancies everywhere we looked. People were saying, you only use one in one case and only one in another. Then it seemed to be a disciplinary difference. And then that fell apart. And all of a sudden, we realized they were described to be different things, but we had no idea really what the argument or logic was about when you would choose to use one over the other. So it was a very humiliating experience. And we decided we needed an expert to kind of set ourselves straight. So that's the history of how Sam got here. And so we're really glad you're here. And we desperately need you. Well, thank you. Thank you. Might be nice. Can we just go around the room? You can. I promise I probably won't remember your names, but it's just nice to have that familiarity while we're together. Hi, I'm Joelle. Joelle? I'm Ruka. Ruka? I'm Laurie. OK. Sophia? Janine. Janine. Melissa? Eija? Benz? Shereen. And we've got? William. Jeff. OK. I do this with my students every semester. It takes me at least four or five weeks to remember all the names and faces, but it's always a nice thing to at least be able to talk to people. So what I did was take some of my own lectures and pull them together to try and answer this question, which I transformed into, for those of you, as I'm looking, or maybe old enough to remember the movie. It was a Billy Crystal movie called Analyze This, which saw taking a little license on that one. At any rate, there is a lot of confusion. My students, I mean, I've been teaching this now for at least the last 10 years. And this is a very hard differential for most students to understand because it's a little bit nuanced. And there's enough similarity in what goes on when you do a principal components analysis versus an exploratory factor analysis that we lose track of what is the difference. And maybe the true difference is never really set in until you start using it and then you forget what those are. So let's make this as informal as possible. I don't know exactly how long this is going to take. But what I've done is I'm calling it an overview, but what I'm really going to do is kind of walk through a lot of the components of factor analysis that are common to both principal components and exploratory factor analysis. Now this is going to get even more confusing because another common term for exploratory factor analysis is common factor analysis. But that's not common because it doesn't apply to principal components. So we'll try and keep that straight if you get confused. Again, stop me. Let's talk about it, whatever. But ask questions as we go through the material so that you make sure you get them answered. I'm going to cover kind of the similarities first. And then I'm going to look at some of the specifics for both for each of principal components and exploratory factor analysis. And then I've brought an example that's a little bit, it's not totally contrived because the data's real, but it's contrived in the sense that you probably wouldn't use both of these methods on it, but it provides a great example of what happens if you use the different methodologies and you can really see how things are different, which hopefully will resonate a little bit better than if we tried something a little bit closer to your world, okay? And then I'll summarize some of the similarities and differences and then if we have time, which I'm not sure we will, but we can touch a little bit to confirmatory factor analysis, which is related in a way, but that's kind of a separate discussion, okay? All right, so just to get started, let's take a test, okay? So let's just, again, don't feel embarrassed, just I'm gonna ask you some situations and have you indicate whether you think this is a situation for principal components or a situation where you use exploratory factor analysis, okay? So if we wanted to reduce a large number of variables to a smaller number of factors that we were gonna use in analysis later, how many people think that's principal components, okay? How many think that's exploratory factor analysis? Oh, come on, everybody has to vote now. You can't have this. Okay. How about reallocating the variation in a large number of variables? How many would think that was principal components analysis? How many would think that's exploratory factor analysis, okay? What's voting? No, we're getting some hand, we're getting hand. Everybody's sneaking one up, you know? They want to show anybody else, but they're letting me see it, okay? How about creating an orthogonal representation of the original variables? Principal components analysis, okay? Exploratory factor analysis, okay? And that can solve problems of multicollinearity. Identifying underlying dimensions in the data such as constructs, principle components analysis, okay? Exploratory factor analysis, okay? And the rest of you are not making a commitment here one way or the other, okay? Regression with many correlated variables, principal components analysis, okay? Exploratory factor analysis, okay? Creating a hypothesis for a confirmatory factor analysis, principal components analysis, okay? Exploratory factor analysis, okay? That's probably the one we got closest to getting correct. Okay, so let's, I mean, that's good because clearly there's some difference of opinion here and that always makes for a more interesting discussion. So for both of them work to reduce large number of variables to a smaller number of factors. So that could be principal components or exploratory factor analysis. Doesn't really differentiate. Reallocating the variation of large number of variables, that is much more oriented toward principal components analysis, okay? And we'll talk about why. Creating orthogonal representations of the original variables. Well, without the subheading there of solving problems of multicollinearity, actually both of them do that depending on the rotation that you might employ. But really principal components analysis, the basic algorithm for that is really, that's part of the orientation of that algorithm. So it's a little bit more toward principal components but it also tends to happen with exploratory factor analysis, okay? Where are we? Underlying dimensions of the data constructs, most of you said exploratory factor analysis and that is probably why you would be driven toward exploratory factor analysis. Regression with many correlated variables, that would be principal components analysis, okay? Again, regression is a variance-based methodology. You're trying to capture as much of the variation as possible and that's really what principal components does. And lastly, creating a hypothesis for a confirmatory factor analysis. Again, that's much more oriented toward exploratory factor analysis. Probably wouldn't use principal components for that, okay? But these are the questions that trip us up. These are the situations. So let's try and figure out why this would be the case. And oh, we should say that factor analysis in general, both principal components and exploratory factor analysis, it's an exploratory technique. It's not a confirmatory or there's no significance testing associated with it. It is exploratory the same way descriptive statistics are. So what you get at the end is not that, okay, this is the answer. What you get is something that you feel confident that you can interpret, okay? So there's no right or wrong here. But ideally what you're trying to do is reduce the dimensionality of your data. So you're trying to take large numbers of variables and reduce it to some smaller set that you can then work with, okay? All right, so overview. So these are common elements. These are some of the key things that are common to both principal components and exploratory factor analysis. So you're always gonna start off with some number of variables. So these are the observed variables that we get, whether it's from a survey or however we generated them. Now, if there's no correlation between these variables, there's no reason to do any kind of factor analysis. So there has to be some sort of inherent correlation and typically each variable here that you're using in some sort of factor analysis should have at least a correlation of some, and these are rules of thumb. These aren't, there's no theory behind this, but typically you want at least a correlation of 0.3 with at least one other variable, okay? Otherwise, probably that variable is gonna drop out of the analysis somewhere along the way because it's not gonna factor well, okay? So we're starting off with some group of correlated variables. Ideally, we'd like to have some sort of conceptual framework. It is possible to go into a factor analysis saying I really have no idea what variables should kind of hang together here or what they should represent. If you're doing that, there's a good chance you're gonna end up with something that is not justifiable, not supportable. When you come out of a factor analysis, you're gonna come out with something that you feel is interpretable and you want it to be something that you feel there's some justification for besides the data, okay? So you wanna be able to point to some literature, point somewhere, say, you know, this is what kind of literature says should have happened or this is what some other study kind of found and we just add a little bit more to it, et cetera. So you have some something to hang your results on besides just I think this sounds good. So I like this result, okay? So, but you don't always have that. Obviously, you also wanna think through some other issues that could impact the results. So if you have other extraneous variables such as gender or some other kind of criteria that might affect your data, might affect the factor analysis. So factor analysis might be different for males and females. Well, if you lump them together, you may not get a definitive factor analysis. It's kind of like dealing with a bimodal distribution and saying, let me take the mean. Well, the mean doesn't represent either of the populations that created the two lumps in the data. So you end up somewhere that it's not really part of either set and the same thing can happen with factor analysis. So you do have to kind of think through what are some of the other variables that could impact the results you're gonna get and do I need to have enough data to be able to run them separately so I can tell whether that's a problem or not, okay? That's true again of either principle components or exploratory factor analysis. Sample size, this is always an interesting topic. There is again, it's not theoretically determined. These are rules of thumb, okay, from people who have done lots of these. Typically, if you're doing a factor analysis, you want something like five to 10 observations per variable at a minimum. These are all minimums, not maximums, okay? Why is that? Well, you're gonna come up with some sort of loading for each variable on a factor. So the fewer observations you have in your dataset, both overall and per variable, the less stable that loading's gonna be. You'll still get a loading. It'll still run. You'll still get an analysis, but you have to ask yourself how reliable is that result? So these are all issues with the reliability of whatever analysis you're running here. I usually use somewhat the same for regression. So if you're doing regression analysis, you might wanna think about having, I usually use the 10 number, not the five, but I've seen five to 10. But if you're doing regression, typically you want something like 10 observations per independent variable. Now if you have lots of independent variables and not a lot of data, you start to run thin pretty quick. I've seen limits that say you wanna have at least 50 observations overall, but I've also seen them that say you shouldn't run a factor analysis with less than 200. So there's not a lot of consensus here, but again, the range affects the reliability of the overall model that you're creating. So the first number of observations, reliability of the parameters that you're estimating, total number of observations is the reliability of the overall model that you end up with. And then lastly, there's some suggestion that you wanna have somewhere between two and five variables per factor that you have in mind. So if you have an underlying idea, all of the conceptual kind of foundation, that you're looking for somewhere between three and five factors, then you might wanna be thinking about I want enough variables for each of those factors so that I get a stable factor measurement. Too few, you may not be capturing the essence, the true nature of the underlying factor. Too many, you can always throw some out, okay? So these are just, yeah. You said for the 200, you've heard that you shouldn't run a factor analysis if you don't have at least 200? I've seen suggestions that you wanna have at least 200. I'm giving you the range. Again, it's gonna depend how many variables you have and how many factors you're using, et cetera. There's that impacts the number as well. But it's a little bit like power if you're familiar with statistical power. If you really wanna be able to identify a pattern if there is one, then you wanna have enough data to be able to get it accurately. Since we aren't doing any statistical testing, you don't have the same requirements for sample size that you would if you were gonna do a t-test or something like that where you say, okay, I need at least 30 to assume normality or something along those lines. So it's a little bit more variable. Any questions? All right, oh, I just went backward, I think, right. All right, so this is a little bit of a technical issue which you probably don't think about a whole lot because you just hit the button and run your factor analysis. But underlying the factor analysis, there is, the factor analysis uses this data in a certain way and there are some impacts. You have choices. So you can run it on a correlation matrix or you can run it on a covariance matrix. Covariance matrix just means the data is centered. It's not normalized. So it doesn't have standard deviations one. Correlation matrix means that they do, okay? Now, if you use standardized variables, and this has more of an impact on, well, it has an impact on both. If you use a correlation matrix, all your variables have essentially the same starting weight. Okay, because they've all been standardized. If you use covariance matrix, variables that have more variants associated with them are gonna have more weight in the factor analysis. So they're going to orient the results toward those variables that have higher variation. And you're gonna see different loadings. You do have this option. And it is something that we will come back to when we look at the differences between principal components and exploratory factor analysis because this is where the change is made to run the algorithms. So if you're using SPSS, well, we'll talk about it when we get there. There are ways around it. Okay, so I just wanna point this out here that you do need to be aware that you need to at least think of when you go to run either principal components or exploratory factor analysis. Do I want correlation matrix-based or covariance matrix-based, okay? And you can always run them both ways and see what happens with the results, okay? Yes. Variables that don't have a lot of variation oftentimes you're not all that interested in. And so you don't want to make them have the same weight as a variable that does have a lot of variation because there's variation is information, okay? Think of it this way. If I have a variable in it and every observation is the same, it doesn't really tell you anything, right? Because you only need one observation and you know everything. But variable that takes on a range of observations has more variability. You're gonna learn more about, you know, how does it change what's the relationship to other variables, et cetera. So there's more information in that variable than there is in one that has a very small variation. As soon as you go to a correlation matrix you've eliminated that difference in information. Now they all have the same. Why would you do it? Well in most of the studies that I've worked on with folks here, you aren't starting out saying I have, you know, I give a survey, okay? And I've got it planned out on a one to 10 scale. And I'm not expecting that necessarily and Barb can attest to this because I always tell her which variables don't have a lot of range to them and oftentimes we kick those out of the analysis. But typically you're expecting respondents to answer across the range. And the range is limited because you've, to find your scale in a limited way. And so you don't have a difference in range on those kinds of responses. Where you're more likely to see it would be if you're taking blood pressure and heart rate, okay? Blood pressure has a longer scale, right? So it has more variability associated. So if you were looking for a factor that was more oriented toward blood pressure, if you standardize it with heart rate then they're gonna be treated the same and have the same influence on the factor. Versus if you deal with their covariance then it's more likely that blood pressure is gonna separate to some extent from heart rate. Does that help? Yeah. Okay, all right. So another area of consistency across these is how do you measure whether your results are reasonable? Now I didn't say whether they're significant because we can never test that, okay? But we can at least talk about do they satisfy some general characteristics that we would expect for data that would be useful to factor. So there are a couple of measures that are commonly used. One is Bartlett's test of sphericity. So most of your outputs will come up with this. How many people are familiar with this? Okay, so Bartlett's test of sphericity is kind of a, it is a significance test that looks at the correlation matrix and says are there any off diagonal elements? So you wanna fail this test because it's basically saying is your correlation matrix and identity matrix. If it is an identity matrix, right, then there's no correlation. So there's no factor analysis. So this is a test of the null hypothesis here is the correlation matrix and identity matrix, the alternative is if it's not. So you wanna fail this because you want a matrix that has covariance in it or correlations in it. That's one of the first things we said we wanna have in our variables, right? So if you've done this visually, if you looked at the correlation matrix and you've checked that all the variables have at least a correlation of 0.3 or more with at least one other variable, you should fail this test without any problem. So this is just a double check. Yes. Go ahead. You wanna reject the null. Yeah. Right. Yes. Okay. No. Yeah. You wanna reject. Okay. Let's be careful about my language. All right. There's some other measures that are also useful. There's something called measure of sampling adequacy. How many familiar with this one? Okay, good. So there are two places where you wanna look for this. So there's an overall measure. This is also the KMO, which stands for the, I think it's Kendall something, Ogrom sometimes called the KMO, which is three guys names. And they took the first initial of each of them. At any rate, there is an overall measure which says as a data set, how well is your data set oriented toward factoring, okay? Again, rules of thumb. These are not statistical tests. Rules of thumb are that, if you're at least 0.6, it's adequate. Ideally, you'd like it to be higher, okay? Won't go up above one. So it's kind of like correlation matrix measure, okay? But there are also measures for individual variables. Yes. For the KMO statistics. It might mean that you've got a couple of variables that are really highly correlated. I'm not sure. I've seen one that's above 0.9, I don't think. You're just like theoretically what was that? Yeah, I don't know. Okay. Okay. Each variable has an associated sampling adequacy measure. And again, you want to look at individual variables the same way, but you're going to run into problems typically if you see any of those having a value less than 0.5. And what is typically recommended is that you take out the lowest one. The one that has the lowest value, just take that variable out. It's not going to work. You're going to get low communalities. You're going to get low loadings. It's just not going to fit anywhere. It's going to probably screw up some other things. Take it out, rerun this part, get to this and see if they're any more less than 0.5. If there are, take the next lowest one out. So you work at this one at a time. You do not take out, if you see four of them less than 0.5, don't take out all four because you may be throwing away good data. So every time you take one out, all these measures get recalculated and ones that had been lower than 0.5 may rise above 0.5. So it's an iterative exercise. So these are just measures that you always want to look at and check to say, do I see any problems? Before you ever get into rotations or extractions or any of that stuff. And it doesn't matter whether it's PCA or exploratory factor analysis. So again, what happens when you do one of these analysis typically you want to think about the factors and which ones you want to keep. That's what we mean by factor extraction. You have this issue in both. So the first is how many factors do you keep, right? And there are typically three ways. Well, let me step back and do one thing before we talk about that. There's also an issue of variability and how you're measuring that. So when you think about correlation, correlation is shared variance, right? If you think about a simple regression, one independent variable. If you run that on standardized variables, what's the coefficient that you get? What does that represent? Anybody know? It's the correlation between the independent variable and the dependent variable. So your beta, your slope, right? If you use standardized variables there's no intercept. So all you get is a slope. That slope is just a correlation between X and Y, right? So that correlation is what we call shared variance. And if you want to know how much variance is shared, you square it. And what does that equal? That's your R squared, right? It's the variance explained in the dependent variable that's explained by the independent variable, right? Same idea here. Is that we've got correlation in these variables. That correlation represents shared variance. The loadings that you get out of a factor analysis represent correlations. You square those, they represent components of variance that is shared with the factor. Same idea as regression, no different, okay? But you have to understand when we get to differentiating principal component and exploratory factor analysis that there are different pieces of variance. So there is common variance, and that's the variance that you're saying is common between the variable that you're measuring and the factor that you're computing. There's also unique variance. And that's variance that's specific to the variable you've measured. So if you think about measuring intelligence, okay? We don't know how to measure it. We have lots of variables that we use to get an observation on it. So we could do an IQ test, we could do an SAT test, we could look at your grade point average. All those would be measures of intelligence. And we would typically put those together in a factor analysis and hopefully they would all be measures of intelligence, okay? And we would come up with some new factor called intelligence, right? That would be the idea. But each of those variables has variance that could be what we call common variance that's shared with intelligence. And in fact, if you followed the literature on some of these tests as you've gone through college, et cetera, you probably have seen some of this discussion that when you take an SAT test, it's measuring socioeconomic issues, it's measuring all kinds of other things beyond just intelligence. So if you're trying to measure the intelligence factor, construct, there's common variance, which is common to intelligence. There's unique variance that's unique to the SAT test that doesn't relate to intelligence, okay? But it causes the measure to vary, causes the SAT score to vary, independent of how intelligent someone is, okay? Now there's a third set of variance which is also a measurement variation which says, you know, if I give the same person the same SAT test, you know, and it's raining one day and sunny the next day, they might get different scores. And that's just measurement error. Oftentimes you can't really differentiate between unique variance and measurement error. So those are typically lumped together, okay? But it's important to keep in mind these different components of variance because this is one of the key differences that comes up to differentiate principle components analysis and exploratory factor analysis. It's what is the variance that you're really measuring, okay? And also you need to keep in mind the objective of the analysis. So principle components and exploratory factor analysis both are based on measuring some part of variance. It's just whether it's the common variance or all the variance. And that's really one of the key differences, yes. Well, hold that question. Once we look at the models behind each of these techniques it'll become very clear exactly what they're measuring. You'll see it, okay? But I just wanted to set the stage here by saying when you think about the variables that you're measuring and you're trying to set up in the background some sort of factor analysis there is this sub-setting of the variance within each of those measured variables, okay? Okay, I think we might have missed something here. Let me just, no, I guess not, okay. All right, so when we get to principle components principle components assumes that all the variation is common, okay? So it's assuming that the unique variance, the measurement variance and the common variance is all getting lumped together and it's calling all that just common variance. So it's variance, all that is considered variance that it shares with the factor, okay? And in that case, and I'm just making this point that the diagonal, the correlation matrix in the analysis that you run in principle components has taken to be one, because that is the proper variance in a, if you standardize the variables, right? So you're putting all the variance, variance of one, into the analysis. So just keep that thought floating in the back because it's not important that you understand theoretically what's going on, but just that there is a difference somewhere and this is one of those places. So the assumption in principle components is just that we can represent all the variation in our measured variables through the factors, okay? And typically then the objective for principle components analysis is going to be find the minimum number of factors that capture the maximum amount of variation. So the objective typically in a principle components analysis is variance adjustment, if you will. It's moving the variance onto factors, trying to get the smallest number of factors that account for the largest amount of variance, okay? Now when we get to exploratory factor analysis, yes, go ahead, hold that question. My best straight man, okay? Just you're raising great questions. As soon as I show you the model, it's going to be really easy for you to see how this happens, okay? So I'm just wetting your appetite. So exploratory factor analysis, if we look at that, also known as common factor analysis because it only looks at the common variations, okay? Or principle factor analysis is another term that's often used and this leads to the confusion. So when you see principle factor analysis versus principle component analysis, it's like, what's the difference? And this is why a lot of people don't fully appreciate that there are two different models in the background that are going on here. So with this one, we're assuming that the factors explain only the common or shared variance. So it's assuming that you have unique variance that is separately dealt with. Principle components analysis lumps it all together. So it's all the same. We don't care to differentiate. We're going to just move as much of it as we can onto a small number of factors as we can. Exploratory factor analysis says we understand there are different components to the variance and we only want to align the common variance with our factors. The unique variance and measurement error, we're going to separate out and deal with them separately. But our factors are only going to represent that common variation. Think of now, come back to a regression. Simple linear regression, what do you have at the end? I have y equal beta naught plus beta one x plus what? Error, right, epsilon. So what would that be more like? Principle components or would it be more like exploratory factor analysis? Right, because what's it doing? It's separating out the unique variance, right? Because that epsilon represents a unique variance that's not common to x and y. It's measurement, in that case, we call it typically measurement error, but if you remember your textbook, it also accounted for any x's that you didn't include in the model, right? So in some sense, there's error there that isn't part of what's in the x. Same thing, exploratory factor analysis kind of saying the same thing. We're treating the factor as your y variable, in some sense, and we're saying, well, we have all these measurement variables that have variance in common with the factor, but then there's also some other variance which we aren't trying to say is part of the factor. Principle components, no, saying all the variation is part of the factor, okay? Measurement error, variable error, whatever, okay? All right, so keep that in mind. Now, difference in exploratory factor analysis. Remember I said that in the correlation matrix, we put ones on the diagonal. If you're gonna look at exploratory factor analysis, we do not. We put a measure of the communality. Communality is the common variance. So if you have standardized variables, variance of each measured variable's one, right? If not all that is being shared with the factor, then some value less than one is the common variance, right? Anybody follow? So therefore the diagonal, if you put the communalities there, the diagonal's measures are gonna be all less than one. So the matrix that you're using to do the analysis in an exploratory factor analysis is not a correlation matrix, exactly. It's actually, once you put those communalities on there, it's equivalent to using the covariance matrix, okay? Pretty tricky. Again, source of confusion. Now, why do you need to know this? Well, if you run SAS, and you wanna run exploratory factor analysis, you need to tell it to put this on the diagonal. If you don't, you get a principal components matrix by default. You get a principal components analysis by default. In SBSS, if you just run it without choosing an extraction method, you're gonna get principal components analysis. That is the default. You don't have to worry about what's on the diagonal because SBSS is just a little bit nicer and it allows you to just choose an extraction method that takes care of all that for you. SAS is just a little uglier, okay? All right, but it's at least something you have to think about. Now, what's the objective intent here? Typically with exploratory factor analysis, your intent is you're trying to look for some sort of latent constructs. Now, it's not always easy to define what's the difference between a latent construct and a factor that would be a principal components factor, okay? Let me give you an extreme example. Consumer price index is a principal components analysis. Right? Why? What are they trying to do? They're looking at price changes, variation, across lots of market basket of foods, right? And they're combining all that to give you one index that represents how variability, how much variability there is in all those foods. So what did it do? It took all the variation from all those foods, pushed it all onto one factor, calls it an index, becomes the CPI. Now, what have we noticed recently if you follow any economic data? When they report the CPI, they don't just report one index anymore. Well, they kind of report several. So there's like a food CPI and a total CPI and a fuel CPI, why? Because it doesn't all come out on one factor anymore, right? So you have kind of some variables like the change in fuel prices and heating oil and stuff like that loads on a different factor. So it's a different index. Right? Make sense? So you would use principal components for that because you're trying to push all that variation in the food, in the prices onto as few a number of factors as possible. You're not trying to understand the essence of a shopping basket, okay? Now the work that you guys do is more oriented toward constructs. You're looking at kind of concepts that can't be directly measured, but that we feel theoretically are out there. Intelligence, worry, hope, great concepts. How do you measure them? Who knows? But you're not looking for an index typically. Now you could be, and if you are, then you're gonna change your mind and say principal components should be the way I go because I'm trying to create a hope index, for instance. For those of you who have used the depression scale, which I know is pretty common, there are people that use that as an index. You'd analyze it differently. You should analyze it differently, okay? Not everybody does, that's one of the problems. I mean, you guys shouldn't feel that your misconceptions about either of these is unique to this room. You can find it in the literature all over the place, okay? So, objective here, a little bit different. Oftentimes, with exploratory factor analysis, also you're looking for kind of the input to a confirmatory factor analysis. So ultimately, you wanna go down that path. You're typically not gonna be using principal components to get there, okay? And in fact, principal components may lead you down a false trail that may not work out, okay? All right, now, last piece of commonality here is that oftentimes, if you have lots of variables, you'll get the very similar results out of principal components analysis and exploratory factor analysis. There are cases where your data set is set up, just the data comes in in a way that if you ran it either way, you'd get pretty much mirrored results. And you'd say, well, why would I do it one way or the other? And that's one of the reasons why people get confused because I've run a certain data set and I didn't know what, maybe I didn't know what I was doing, maybe I did, I ran it both ways, and I still got kind of the same interpretation. That often happens. But it's because of the data set you're using not because the methodologies are the same, okay? So you have to keep that in mind. And confirmatory analysis, factor analysis, I'll just mention this now, is not an exploratory technique. So remember, what you've done, what we're doing in PCA and EFA is exploratory. We're trying to get a better understanding of how the data hangs together. If we really believe that we've got something there that means something, so if we really think we're coming up with a measure for intelligence, we need to shift over to confirmatory factor analysis and do a different type of analysis where we can actually do some statistical tests that say this is a valid measure, okay? So I'm just putting that in as a touchstone because we're not gonna talk much about it today, but that's kind of the next step after you've done either, typically an exploratory factor analysis, okay? All right, so let's look at where they're different. So now we're gonna look at the actual models, okay? So let's look at the model for principle components. So the idea behind principle components is you're trying to find a linear combination of the Xs, the measured variables, the Xs remember the ones we said, these are the ones you'd get survey measures on or whatever. So we wanna find a set of coefficients. Oh, I do wanna make one comment just because I run into this too much. This is meant to you, but others as well. If you are thinking that you're gonna do a factor analysis at the end of the day, okay, and you're doing a survey, the more levels you can include on your survey, the more powerful your scale is going to be at the end of the day. So if you look at the literature for market research, which oftentimes surveys much of the, many of the same characteristics you do, they've all moved to a 10 point scale. They don't try and define every point on the scale. They may still only use five, you know, measured points, but A, it's even scale. There's an even number, so somebody can't just go down the middle all the time. Okay, they have to kind of register one side or the other, but also you get much more variation. Remember, we're talking about variation here. If you only have four items on the scale, it's really hard for people, people may be willing to differentiate their views, but you haven't given them an opportunity to do that. And so it all gets lumped together. You've lost variance. If you think about taking a variable like temperature and saying I'm gonna make temperature into five buckets, well, okay, I've taken zero to two 12 and I've got five buckets. This is what, 30 degrees, whatever in each bucket. So I tell you you're in bucket one. You don't know what the temperature is. Same thing with Likert scales. If you use a five point Likert scale, you're saying, well, I could have used a 10, but I've really collapsed them. And so I really don't know whether you're a one, two, three, four, five, six, seven, eight, nine, or 10. I only know whether you're one through five. Now you don't know which one's got collapsed. So I've lost a lot of information. Now there's always a trade off with, you know, beating people up and making them answer tough scales, but you should start thinking about making sure that you have enough points in your scales. So that's really at odds with a lot of the movement in social science right now. And that's because it's hard for people to understand a construct in a way that they can differentiate those degrees. So the argument, the other argument, nobody would argue with the statistical argument because we want variance, is that you get less meaningful data because we can't really discriminate if we're an eight or a seven. When we're trying to figure out how hopeful we are, how much we trust something, something that's a very sort of mushy gut level, as much emotion as it is cognition, concept. It may not be reliable to discriminate those things. The, well, typically when you ask your respondents, you aren't asking them to say how hopeful are you. You're asking them a whole series of smaller questions, which is usually easier for a respondent to make a judgment on. So whether you, you know, all I'm suggesting is think about it in whether you go to a, it seems to be we're standardizing our scales at a smaller number. And the rest of the work. And that's fine, because that is measurement variation versus if you lump it all together, you don't know what's the signal and what's the error. So it's harder to differentiate. Mr. Lisa, I just want to emphasize this on trying to be able to label. What's a two? Do you might be a three? Do you might be a four to another person if there are no labels? But really we're the same on the construct. It's just that you interpret it as two is what I think of as a three. And so I encountered this a lot, a whole lot, this tension between sort of the statistician says more points on the scale, more points on the scale. And I'm saying, but make sure they're meaningful points on the scale. So I just really want to put a plug in for you to balance those two things. No, and I would certainly agree with that. But I think there's a lot of research as well on consumer research where it's pretty much you're asking people for similar thoughts about products, et cetera, where the scales are not any better defined than a five point Likert scale. So there are points that don't have specific definitions associated with them and they're getting much better results. So I'm just saying there's research out there might want to take a look at it and see what some of the theory there is or what some of the experience there has been and see how it might work into what you're doing. That's, I mean, that's a good question. I mean, at a pilot stage might be something you might look at. The other thing is you can always collapse it. You know, if you start out with 10, you can always collapse back to five. So you're not, it's not like you have to, you know, you're right. I mean, you can bring it back to the way it has been used. I would also suggest that even if you have a scale that's been verified unless you're using it in a almost the same situation, you ought to be running confirmatory analysis anyway to make sure that it still holds, right? Cause otherwise, you know, if you're applying someone else's scale that they've verified in a very unique situation, it's not, you can't necessarily assume that that scale is going to hold in a experimental situation it wasn't necessarily tested in. So, and we have some experience here in the room with exactly that happening. So, yeah. All right, so that was my own plug for, because typically factor analysis is, assumes that you're dealing with quantitative variables. So a Likert scale, I think the common assumption now if you have a Likert scale with five or less points on the scale, it's considered discreet. It's not considered continuous. And we've been getting pushback on reviewers. Did we get a review? Did that come back on our paper? I can't remember. But I know it did on one of ours where the reviewer came back and said, you can't run this using standard factor analysis. You need to use a different type of software that treats actually measures the correlations in a different way. It does not assume that you can compute them the same way you would with continuous data. So this is another thing to think about is that as you use these smaller scales, if you're running exploratory factor analysis through SPS and SAS, reviewers are getting more savvy, they're gonna reject the paper and say you didn't do this right. So that's not a statistician's view. In fact, actually we found some significant differences too. If you run a five. Can I ask a question? Yes, yes. And you do get different results. Yeah, it's ugly. There's other software that's more effective if you wanna do that, but okay. So we're getting a little bit off track, but good discussion. All right, so anyway, for principle components, notice this is a set up that has all the X's on the right, your factor is on the left, right? There's no error term involved here. So this does not look like a regression equation. And the idea is that you're estimating two things here. This is kind of interesting. Factor analysis, both exploratory and principle components both have a flavor of this. So you're in a typical regression, you have your X's and your Y's, right? Here you've got equations that you're estimating the coefficients for, but you only have X's. So not only are you estimating the Y's, you're also estimating the slopes at the same time. So this is pretty slick. So if you didn't fully understand the slickness behind factor analysis, this is pretty neat stuff. You're essentially running regressions here where you don't have a dependent variable. Now, you'll also notice something else in this model. This would look just like a regression model, except for what? There's no error term. Because we said with principle components, all the error is captured in the X's. It's all shared. There's no unique variance, right? So there's no error term. So this gets measured exactly, okay? Now, the way the algorithm for principle components works is that the reason you can do all this, you have to put other constraints on the way that you do these estimations. If you just want to do this like a regression, you couldn't solve it. So you put some other conditions on. Those conditions have to do with the variation. So the first one is, the first component you estimate, you pick the direction, which means the A's, so that you maximize the amount of variance in the X's that you share with that Y1, okay? So the first component captures the maximum amount of variation in the X's that it can share, okay? Then you go to the next component or factor, and you choose the direction for that one to capture as much of the remaining variability. So I've captured a certain amount of the variability of the X's in my first factor. So I have some left. I pick my next factor to be perpendicular to the first one, and capture as much of the remaining variance as possible. So again, I don't get it all, but I get as much as I can. I choose the A's so that I do that while at the same time giving me a direction in space, kind of the XY coordinate system, that's perpendicular to the first factor, okay? So this is my orthogonal factors. Then I keep going, and I can keep doing this, and I do the third factor, the fourth factor, and in fact, I can get exactly P factors. So if I start out with P measured variables, I can generate P factors and account for all the variance. So if I have five variables that I measure, I can get five factors. Those five factors will account for exactly the same variance as I had in the first five variables. Now the trick is, if I keep all five of those, I haven't done anything, right? I've gone from five variables to five variables, so ideally I want to get less. That's, so I gotta choose which ones do I keep, okay? But that's the idea behind the model here. So all the variance in the X's is transferred to the factors. Now, interesting problem here, if you started out with independent variables, so if your X's were independent, what would happen? If you ran the factor analysis, the first variable, or the first factor, would be equal to the variable that had the highest variation. The next one would be equal to the variable. I had the second highest variation. So you'd just be equating your factors to the variables, and this is why you don't do factor analysis if you've got independent variables, because you don't learn anything. Nothing aggregates, okay? So, all right. Now, there's one possible issue you may run into, and that's if you have very highly correlated variables in your data set that may impact the way this runs or the results. So you also want to be aware of that. There are different things you can do. You might drop one, because if they're very highly correlated, then they're probably measuring the same thing. Alternatively, you can average them, whatever. You're capturing all the common information. So there are ways around, easy ways around that, okay? So, once we've got, we run this, we understand the model, now we have to analyze the factors. This is where we look at how many factors do we keep? Yes. You want me to go back? Yes. That's right. You don't do any kind of factor analysis. Because if they aren't correlated, there's not gonna be any shared variance. So there won't be any common variance. So you'll still get, you'll be like running regressions, essentially. Could be? Well, yeah, well, first of all, if you're doing something like hope, you wouldn't be using principle components, okay? So, I mean, this model wouldn't be the one we would be applying anyway. And so this algorithm wouldn't be used either. Well, that's a very good reason why you wouldn't use this. Because when you're trying to do hope, your goal is not to move the variance. It's not to shift as much variance as you can onto a smaller number of factors. You'd be more oriented toward an exploratory factor analysis where you're trying to define a construct which is only sharing common variance with some of the X's. Now, we're gonna have the same issue there. So I'm not saying that you're not gonna run into that problem. So let me deal with your question you asked. Is all of these algorithms that you're running in factor analysis, there's a whole bunch of them. And I'm just giving you the model and telling you what happens when you run the base kind of default. But we get an option to rotate, right? The rotations can either be orthogonal or non-orthogonal. And those rotations, what they do is shift the variance around. And they shift it around, they capture the same level of variance and they shift it around based on how you define whether you want it orthogonal, non-orthogonal, whatever. So yes, there's a very good argument why you wouldn't want orthogonal factors in many analyses that you run. From the point of view of developing a scale, this is purely exploratory to give you a hypothesis as to what you might push into a confirmatory factor analysis. Because one of the other things you're getting out of whether you run exploratory or principal components, every measured variable has a weight on every factor. Even though your interpretation only looks at the high weights, right? All the other variables still have weights on that. When you wanna shift over to a scale, you wanna get rid of all those small ones. So when you go to a confirmatory factor analysis, you're only loading the high ones on the factor. So now you've got a completely different problem than what you're getting out of exploratory factor analysis. This is purely descriptive. You can't be selling the farm for a scale based on the results you get out of exploratory factor analysis. It's not what it's built for. It just gives you an idea as to which variables might aggregate on factors that you might be able to interpret in a way that makes sense in terms of you're ultimately moving to a confirmatory factor analysis. So it's a step in the process. It's not a definitive, yeah, I need this result and it has to be exact, and I can't be orthogonal or whatever. This is, there's no way to test what's right here. You can run every different rotation you want here, whether it be orthogonal or non-orthogonal with the idea of, am I able to interpret the factors I get out of this better? And if you can, then you're gonna say, yeah, I like this one better. It doesn't prove that it's the right one. There's no proof in this analysis that whatever you come out with is the state of nature. It's what you can build an argument for, okay? So all these are tools. They're not answers. You still have to create the answer out of the output that you get. So it's not pushing a button and taking a result. It's pushing lots of buttons, looking at lots of results and trying to say, what makes sense? And we'll look at an example. So this will become clear maybe in a few minutes, okay? All right, so we have to figure out how many of these factors do we keep? We said if we run all the factors, we have P variables, we get P factors, we captured all the variants, but we still got the same number of variables. Well, I'd like smaller number, please. So I need to figure which ones to keep. Three standard ways that people look at this. Either you look at the variance of the factors that you get and you pick any factor that has a variance greater than one. Y one, anybody know? Right. So you want the factor to at least have more variance than is in any one of the measured variables that you're using. Remember, they're standardized. So the variance is one. So ideally, factors with variance less than one means I might as well just use the measurement variable because I got more variance in that than I do in the factor, okay? That's a very common one. Oftentimes people look at the percentage of variance explained by the factors and say, well, I want to make sure that I have at least some valid percentage of the variance captured. So I might keep a factor that has a variance less than one because of that. It could be a criteria I could use. There is something called a scree test, which I don't pay a lot of attention to. It's a plot that you can look at that oftentimes people use and I think it's oftentimes misleading. Typically I would argue that the variance of the factors is probably the one that most people would look to first. The next, so first idea, how many factors do I keep? Once I decide how many factors I keep, now I can start thinking about how do I rotate them? Yep. What's a good enough percentage that's worth retaining? Well, this again depends a little bit on the area of applications. So, you know, in social science research, probably if you're capturing something like 60%, you can make an argument for it. If you start to capture less than 50%, then you have to somehow work around the argument that there's more variance that you're not capturing in the factors, then you are capturing and so the question is, what are you throwing away? But I haven't, there's no, I mean, it's all rule of thumb stuff. So the higher it is, the more confident you can be that the variables that you're using are measuring something relevant in the factors that you're keeping. If there's a lot of variance that you're throwing away in the factors that you're throwing away, then you have to at least look at them and say, do I not care about what's represented by those factors and they aren't really relevant for what I'm doing. So is that looking at total variance across all the factors or per factor? Well, I mean, the total variance across all the factors is gonna be 100%. So if you look at most outputs that you get from principal components analysis, you'll see that the factors get defined and are lined up in terms of the first factor and they'll show you how much variance and what percentage of the total that is. So it's the percentage, if you have 10 variables, the total variance is 10 because each variable has variance one. So if the first factor explains three variance units, then that's roughly 30% of the variance. If the next factor explains two variance units, then in those two factors you've got 50% of the variance. If you feel like that's not enough, then you look at the next one. If the next one has at least one unit of variance in it, then it still meets the variance of the factor being greater than one. Now you're up to 60%. The next one's gonna be less. So if you were hoping to get 70% of the variance, you're gonna have to start dipping in the factors that have variances less than one to start with. Does that help? All right, that makes sense. Okay, all right. When you get your initial results, when you say I only wanna look at three factors, oftentimes those factors, you can't make any sense out of them because maybe every variable loads on the first factor and every variable loads on the second factor and you've got all kinds of cross-loadings between factors and you just say, well, I can't interpret this. So we start rotating. We rotate because once you've said I want three factors, you've set the communality. So you've set, here's the commun, this is the amount of variance of all these factors, of all these variables that I'm capturing in these, let's say, three factors, which is less than the total now. So all these rotations keep the communality values the same. In other words, it keeps the amount of variance explained in each of the variables common to everything else. That stays fixed. But there are more than one solution that will give you those communalities. So all the rotations they're doing is giving you different solutions where either the factors aren't orthogonal or they are devised in a different way other than that algorithm that I showed you that says pick the first one, put the maximum amount of variance on that one and make it orthogonal to the next one, put the maximum what's left on that one. So every different rotation is just a variation on that algorithm. And it's a variation which says, all right, maximize the, so for instance, I think quarter max is something like, well, veramax says, maximize the variation of each variable so it only appears on one factor. So it kind of maximizes on one. That's why people tend to use veramax rotation because each variable only loads highly typically on one factor. So it makes that factor a little bit easier to understand and explain. It's not the only one. There are other ones. There's quarter max. There's, I don't know, there's about six or eight of them. Some are orthogonal, some are non-orthogonal. If you use an orthogonal rotation, they stay orthogonal. If you use a non-orthogonal rotation then you give up on that, okay? And so you built in, if you believe that the factors you're dealing with should be correlated, that might be better off with a non-orthogonal rotation. But you'd be doing it in exploratory factor analysis, not principal components, okay? For most of the work you guys do, okay? All right, so if you really want to explore those, most of the documentation for whatever software you're using somewhere in it has an explanation of those algorithms and what the intent is, okay? Or you can probably Google it and find it as well. Okay, so we're still doing analysis of factors. The next step in this is really the interpretation. So we have picked, how many factors do we keep in a principal components? We had to say, how do we rotate them? We rotate them because we want to interpret them, okay? That's the whole goal, is you want a set of interpretable, a smaller number of interpretable factors from the data set that you started with, okay? So typically we look at the factor loading, so now we've rotated it, okay? So now we look at what variables have high loadings on which factor, and we tend to define that factor in terms of those variables. So that gives us some way of trying to build a story for each of the factors. And that's all it is, is a story. It's your story, why you think, what you think that factor represents, okay? Plus or minus 0.5, some of that's dependent upon how big your sample size is, higher sample sizes, you'd want it to be higher, smaller sample sizes, yeah, maybe 0.5 some, you'll find a range in the literature. But the loadings that you're looking at, I'm sorry, did you have a question? Oh, I didn't even answer your question. Okay, so the loadings that you're looking at are just the correlations between the variables that you've measured, your observed variables, and the factor. So when you say that the, you're saying, if you're cutting off at 0.5, you're saying, well, any, I just need a correlation of 0.5 or higher, and I'm gonna include that variable in my interpretation of that factor. But if it gets lower than that, then I'm gonna say, well, it's not really a high enough correlation, I'm gonna assume that it doesn't really have a meaning here. That's interpretation. It's rules of thumb, it's not, there's nothing significant, statistically significant about that number. There is a suggestion that 0.7 or greater, plus or minus, is maybe a better measure to use, assuming your sample size is high enough, because being a correlation, if you square it, it's shared variance. So if you square 0.7, you get 0.49, which is roughly 0.5. So you're saying, I'm roughly sharing half the variance of this variable with my factor. Okay, so that means that there's, they're really kind of, that measured variable means something about that factor. Again, it's a rule of thumb, it's not a hard and fast issue. So the communalities that you pick up, the communalities are just the squared correlation, squared loadings, add it up, give you the shared variance between the observed variables and a given factor, and I'll show you an example of that, okay? Now, one of the things you may decide to do is that if you have a variable that has low communality, meaning it's not sharing a lot of variance with any of the factors, it may not be worth keeping it in the analysis. So again, this might be a point at which you say, I'm gonna take that variable out, rerun my analysis, see if I get a clearer picture, because it's still taking up space. The same way, if you don't rerun a regression with, once you've decided the variable's not significant, right, if you take it out, rerun the regression, you get all different estimates for all the other variables. If you leave it in, it's taking up oxygen from the other variables. So you may not be getting a true read of the coefficients, okay, same deal here. Yes? Ooh, I didn't say it'll automatically be clear. The tendency with varimax is to try and only load each variable highly on one factor, okay? Doesn't always work. Yeah, well, if they're all right, then you got other problems, okay? Okay. All right, so now let's look at the difference in exploratory factor analysis. So here, remember we said that the variance of the variables can be decomposed. So here, exploratory factor analysis tries to account for that, so it deals with the common variance separately from the unobserved or the error variance, okay? And so it's only trying to model really the communality. And here's the model. Now, notice how different this model looks from the last model. First of all, the thing that you might notice, the observed variable in the principal components model, factor was on the left, the observed variable's on the right, okay? This is a simple thing to point to. On this one, observed variables are on the left and the factors are on the right, okay? So here we're saying we're measuring each of the measured variables that we observed. So the x's are what we observe and we're saying we can come up with some smaller number of factors, which we don't know, right? We haven't measured them, but through this analysis we can get values for them in a way that they will add up to all the individual variables that we already have. Notice that there's an error term on the right. This looks more like a regression equation, right? There's an epsilon which accounts for all the unique variants in the x. Same way in a regression, the epsilon accounted for the unique variants in the y, right? So does everybody see the difference in these models? Completely different. So when you're using them, you're getting something very different depending on which one you use. Principle components versus exploratory factor analysis, you get different results because you're measuring something different, okay? Now, so there are some, what you get out of exploratory factor analysis is you're actually reconstructing all the covariances between the x's, so that's an actual outcome and that's how the coefficients are determined here. So under these assumptions that, and again, these should look somewhat like regression type assumptions. So the x's and the epsilon's standardized, okay, that just takes out a intercept term. We're assuming that the epsilon's have mean zero and uncorrelated with any of the x's, right? Standard assumption and regression. We're also assuming that the epsilon's are uncorrelated between x's. Standard assumption, right, in regression. So I mean, we're not pulling in anything that you haven't seen before, but it's a very different model here, okay? So you're finding the same set of epsilon's or same set of zeta's, the factors, same set of factors are working in all these regressions. So it's kind of, again, a really interesting thing to think about to say that when we do regression, we have x's and y's, while here we only have y's. And somehow we're estimating the x's and the coefficients. So that's kind of cool. So all the correlation that you get between the x's is mirrored is what is driving the estimation of the parameters here, which is not what drove the principal components analysis. It's all variance-driven, not correlation-driven, okay? So the objective here, find a small number of common factors that explain the correlation between the original variables. Not the same as what we had as the objective for principal components, okay? So the way we get that is instead of putting ones on the correlation matrix, we take the ones out and we put in the estimates of the communalities. And as you go through the algorithm, to estimate these parameters, it keeps replacing the diagonal in that matrix with better estimates of the communality as it refines the estimates of the factors, okay? And in SPSS, this is known as principal-axis factoring. So when you get to that table, and I'll show it to you in a minute, when you pull down and say, what's my extraction method? You have to go to the drop-down list and you have to find principal-axis factoring if you want to run exploratory factor analysis. If you just run it as is, you're getting a principal component analysis. If you want to do it in SAS, you actually have to put in an option for a prior. So you have to put in prior-equal squared correlation, I forgot, SC something or other. Is it SMC, the squared-multiple correlation? Yes, thank you, squared-multiple correlation. So you have to put prior-equals MC to get the exploratory factor analysis. Results, not the principal components analysis, okay? So we called exploratory, which could also apply to principal components because we oftentimes don't really have a hypothetical model in mind. We're trying to see what comes out. And use that as guidance to create a hypothetical model. There are arguments around that, but that's not unusual, okay? Recognize the difference between this and a confirmatory factor analysis, though. When you run exploratory factor analysis, every variable has a loading on every factor, okay? When you run confirmatory factor analysis, you're eliminating that from the situation. So you're saying I only, I'm hypothesizing that I can measure each factor with only the important variables for that factor, and I don't have to use the rest. And then what you test in a confirmatory factor analysis is whether the statistics bear out whether that worked or not, okay? All right, so for the analysis, for the analysis, and this is, again, point of confusion, once you get to saying, okay, I ran exploratory factor analysis, the things you do are exactly the same. Still have to pick out how many factors do I keep? Okay, I use the same methods to determine how many factors I should keep. I still have rotations that I can use to help me identify a better definition of my factors or make it clearer. Use all the same rotations. All you're doing here is trying to come up with an interpretation that you can make sense out of. So you're looking for which one can I build a story around best? Okay, whether hopefully your story has some basis in theory, but... I was asking before, when you're looking at choosing number of factors by percent variance explained, but in this case, the variance is not gonna be the total variance, it's not gonna be sort of thinking about point six out of the total of one. Am I correct about that? Correct. And so do we have sort of criteria or rules of thumb? Well, you'll still have, well, you'll still have a total amount of communal variance. And so when you keep a certain number of factors, you'll still be keeping a certain percentage of that communal variance. Okay, that's what that's been like? Yes, and so that's what that is measuring in this case. So that still makes sense as a criteria or any variance? Right, okay. So again, the rotations basically do the same thing. You have the same rotations to choose from, and your interpretation all done the same way. So once you've run whichever one you want, the stuff you do after that is all the same, whether using principal components or exploratory factor analysis. Okay, no difference there. Okay, so let's look at an example. I'm just gonna run through an example using SPSS so you can kind of see how things differ and while it's still fresh in your minds. And so I'm gonna switch out of this presentation and we're gonna go over here to, so here's a dataset that is actually prices for bread, burgers, hamburger, milk, oranges and tomatoes in 23 cities in, oh, okay. What did I have to do? I didn't. Okay, so we gotta do that again. Okay, I can see it fine. I don't know, all right. So what do I do? All right, somebody guide me here. Oh, okay, all right. So open the SPSS screen. Okay, so let's do that. So we'll do this one. Click and hold and drag to the right. Left. Stand up, sit down, fight, fight, fight. Yep. I did this, the only reason I know is because I did this very recently. Feel free. Wow, you got that up a lot faster than I would have. Yeah, we did this earlier. We don't want to extend it anymore. Right, duplicate, yep. And then apply. We had to do that for the, okay. So does that work? Okay, all right. So here's the data. So this is just prices in some measures for these five items in 23 cities, okay. Now what I'm gonna do is run this analysis as a principal components analysis, then run it as an exploratory factor analysis. You can see how they differ. You probably would never use this in an exploratory factor analysis, but it's useful to see it demonstrates the broad difference. All right, so I'm just going to also, since I have some cheat notes. So if we go to run this, so in SPSS, you'd go down here to dimension reduction in factor, and we'd pick these variables, okay. Now, so I'm gonna go through the various options here. So we might want correlation, well, we definitely want a correlation coefficient, significant levels. Here's our KMO and Bartlett's test of sphericity. We also want the anti-image, and I'm gonna ask for the reproduced because we'll look at that in a while. The anti-image correlation of the correlation matrix is where you get the individual variable sampling adequacies, measures of sampling adequacy, okay. So we'll do that now. Let's look at the extraction. Notice the method up here, default, principal components. If I wanna run exploratory factor analysis, here it is down here, principal axis factoring, okay. So I'm gonna use the correlation matrix. We'll take a look at the scree plot. I'm gonna, initially, because we're doing principal components and I wanna make a point, I'm gonna use all five, there are five variables, so I'm gonna get five factors, okay. So we'll continue rotation. I'm not gonna do any rotations right now. We'll add that in. I'm not gonna save the scores. I can do that. I am going to sort the loadings by size, which just makes them easier to read. And so let's run this. Okay. All right, so here's our correlation matrix. What's the first thing you check? Right, so you'd wanna just check across and make sure that burgers are correlated with at least one other variable at 0.3 or higher, same for the rest. So oranges is starting to look a little iffy, right. You're just sneaking over 0.3 with tomatoes, but the rest of them seem like we're probably okay. All right, as milk also is maybe a little bit low. So right off the bat, you should be kind of identifying, I could have some problems coming up and those are the variables you'd wanna keep track of. Okay, here's your Kaiser Meyer Alchem. Yes, I knew Alchem was the last one. The KMO, sampling adequacy. Okay, we said it should be at least 0.6. It's 0.66, so it's adequate. Okay, not great. And here's our test of sphericity and we're gonna reject at 0.05. So that at least is doing what we had hoped. All right, here's your anti-image matrix. So this anti-image correlation, the diagonal elements here are your measures of sampling adequacy for each variable. So we said those should be 0.6 or greater as well, right? So you see they all are. If one of these had been below 0.5, well we said they really need to be above 0.5, right? If one of those was below 0.5, rather than going any further, we should go back and take it out. It's not gonna help our analysis, may screw it up, okay? So you wanna do all this. This is all kind of upfront. Make sure you've got a decent set of data to work with. Okay, here's our communalities. The initial communality that's used in the analysis and how much was extracted in the factors. Now remember, I used five factors, I had five variables. So the factors are mirroring all the variants. So that's why these are all ones. They should be all ones, okay? And here's the variance for your first factor, okay? It's 2.4, the second one is 1.1, third one's 0.7, fourth one's 0.4, last one's 0.2. If you add all these up, what do you think they add up to? Five, yeah, okay? Cause that's what when you standardize these variables and add up all the variances, you get five, right? I've kept all the factors, they should capture all the variance. So if you add these numbers up, they add up to five, okay? And this shows you the cumulative percentage represented. So the first factor accounts for almost 50% of the variance in all five variables. The first two captures roughly 71%. After that, you've gotta make an argument. You could argue, hey, I'd like to keep 85%. Even though one of my factors isn't capturing at least as much of the variance as any one of the variables did. It's an argument that you can make, it's justifiable, okay? And again, if it helps the interpretation at the end, you may decide, absolutely, I'm gonna keep it. If it hurts the interpretation, you're gonna decide, absolutely, I'm gonna throw it out, okay? Because it's all based on what you can interpret in the output. It's not what any of these measures are at this point. So this is what the scree plot looks like. There's an argument that says you look, a good scree plot has an elbow, right? So if you looked at this, elbow's kinda like right here. So you take the number of factors up to the, and including the elbow, and drop the ones after that. This is just a plot of the variance of the factors against the factor number, okay? So this is showing you where the factor of variance starts to level off and it's not very big. These are the kind of bigger ones. All right, that's the logic behind it. If you're dealing with a lot of factors, oftentimes these scree plots are really useless because they've got multiple elbows and they move around and they're really hard to interpret. Okay, so here's our loadings, okay? And if you look at, for instance, this gives you the loadings on factor one, they call it component one because you run a principal component analysis so they're trying to help you. Remember what you're looking at, okay? It won't say the same thing when you run exploratory factor analysis. So the correlation between burger and component one is .89, pretty high. And if I get that out of there, .78 almost .8 for tomatoes, almost .8 for bread. So this looks like bread, tomatoes and burger, okay? You might say, well, how would I interpret that? Well, if I told you this data was captured by, let's say, McDonald's, okay? So they may be tracking these food prices which have an impact on their business. So this might make a lot of sense for them because this is kind of like the lunchtime food index, right? Then you'd say, well, okay, what about oranges and milk? Well, oranges, you get kind of mediocre loading on one, you get a big loading on two, and you don't get much else loading on factor two, right? So factor two looks like an oranges factor. So maybe that's the breakfast index. And then milk, well, you know, that covers a wide range. That's breakfast, lunch, whatever, milkshakes, all that kind of stuff. So that's on the third factor. So it looks like we have three factors, but if you look at it a little bit closer, you see that, well, this kind of cross loads. So I got tomatoes on four and one, and that's kind of ugly because I don't know how to deal with that. And you might say, this is kind of a cross load, but in opposite directions. So it's not as quite as pretty an interpretation as we might have thought, okay? Now I'll show you one other, so that's our interpretation. Remember, we didn't do a rotation. We kept all the factors. So this is all we had left was to interpret the result. I'll show you one other piece here. These are the reproduced correlations. Yes, question? Yes? Yeah, you'd add up to, yes, okay? Good question and thank you for asking it. I meant to say it. If you look at the reproduced correlations, notice that we were using the correlation matrix, we've kept all five factors. So we've kept all the variances, and if you look at the residuals here from the original correlation matrix, there aren't any. This only works because we kept all five factors, okay? So we'll look at the difference that happens when we only keep the ones we want. So we've looked at this. What conclusion might we draw here on how many factors we might want to keep? What do you think? How many say two? Notice how I led you into, how many want three? Two's have it, okay? It's just more to think about. So let's look at what happens if we only keep two, all right? So that would be equivalent to saying we're just using the eigenvalues here, okay? So I'll let you, I can send the data and you guys can play with this at home if you want to try the three-factor solution. So let's go back to what we ran, okay? And the only thing we're going to do here is going to change the extraction. So we're going to base it on eigenvalues. So if the eigenvalue is greater than one, we'll keep it. And I'm also going to rotate the result, okay? And I'll just use a very max rotation, we'll see what we get, okay? You probably guess I've looked at this before, so I'm, it's a good lawyer, never asked a question, he doesn't already know the answer to. Okay, now notice up front here, did any of this change? Why? What does it depend on? It only depends on the measured variables. Doesn't depend at all on what type of factor analysis you're doing or how many factors you keep or what rotations you run or anything. All these things that we looked at up here apply to all the measured variables. So they are going to change. When you change your extractions, all that kind of stuff, they all stay the same. So we don't need to look at that again. Oh, I'm looking at the, sorry, got to come down, did I? Okay, thank you. See how it all looks the same? So let's see, let's make sure. All right, so we did this, so we need to just say, oh, did, well I was, oh I didn't say, okay, okay, okay. All right, so let's go back. All right, so let's look at extraction. Yes, thank you, continue, and okay. All right, so all this stuff stayed the same. So now look at, when you look at communalities, all right, why are these communalities less than one now because we've thrown away some factors? Okay, so this is all the variance that's measured across those variables in those two factors that we're keeping. Okay, if we look at, let me move this, oh, can you, yeah, okay. So this is the same picture that we saw before, right? This was the same result that we had for the first two factors when we looked at all five factors. But now because of the rotation, what we get is we're shifting variance across the factors, so we've extracted them, so we've kept the same amount of variation in these first two factors, but now what the rotation does is it shifts it to satisfy the algorithm in that rotation. So now you see more variation is going into the second factor and we're losing some out of the first factor in the hope that we can interpret the factors better. Okay, and notice that the cumulative percent because it's based, the communalities stay the same. So the cumulative percent stays the same. It's still 70.1, same as we have right here in the first two factors, right? The only thing that's changing is where that variation ends up. Scree-plat looks the same, so the only thing, this is the same component matrix that we saw before, but it's just the first two columns of it, so this is the same as the five one, but we're just throwing away the first three, okay? So all these numbers here, all these loadings are exactly the same as the ones we saw. When we ran all five. Where it changes are in these two. So here's our rotated components. So now what we've got, notice these are now different than the ones above. So we've still got bread, burger, and milk, but we've also got tomatoes, okay? Now it might even better represent burgers because what do you put on a burger, but you put a tomato on it, right? And then you have over here oranges very highly loading on the second factor, but you still got this cross-loading here on tomatoes. So it's not great, still could be, we might try some other rotations and see if we can get rid of the cross-loading, okay? Otherwise we've got to build a story for it. So I've built a story for you. I'm a McDonald's statistician and I just came up with a way to understand my first factor and a way to understand my second factor. They could, they could. I mean, you'd look at their menu and say why is this happening? Maybe you'd come up with something. But it's all based on the story. It's based on the interpretation. There's nothing in the statistics here that says any one of these rotations is any better than one of the others, okay? Now, so if I do this now and I still have, if I square these and add them up, I'm gonna get the new variance for the first factor and if I square it up here, I'm gonna get the new variance for the second factor. If I square them and add them this way across, I'm gonna get the communality for bread. If I square them and add them across the factors, I get the communality for burger, et cetera. So everything that's up higher, you can generate out of this matrix, okay? All right, so now let's say what happens to this solution if I run it as an exploratory factor analysis. So, let's come down. So I'm gonna run exactly the same analysis, but I'm gonna change my extraction method. So instead of doing principle components, I'm gonna use principle axis factoring, which is my exploratory factor analysis, okay? So let's see how everything changes for this same data set. And I do want to point out one other output that I just, so if I just go back up here first, I wanted to look at this reproduced correlations, okay? Remember the one above? We reproduced when we had all the factors, we could reproduce the correlations exactly. Right? If I only keep two of the factors, notice what happens, okay? My residuals start to be sizable for some of these. And so we find out that there's 70% of these residuals have absolute values greater than 0.05, which is an indication that you're not capturing the covariance. When you run principle components, you don't care because you're not trying to capture the covariance. I just wanted to show you what happens, okay? Now, it's going to be important to us when we run exploratory factor analysis because we are trying to capture the covariance. So again, we want these numbers to be smaller if we run it through exploratory factor analysis, okay? So now we'll go down, here's the exploratory factor analysis. Notice again, top three outputs, so the KMOs, the Bartlett's test, the correlations, the MSAs all stay the same. They aren't dependent upon what you're running. They're just dependent upon the data you're using. So we're using the same variables. Communalities, now notice the communality here. Initial communalities, remember the other one started with ones because it was using the correlation matrix. This takes the correlation matrix and replaces the diagonal with an initial estimate of the communality. So it's not ones now for the initial communality. But this is the final communality, okay? And if you notice, these numbers are a lot smaller than the ones that we saw right up here. So keep these in mind, 0.7, 0.8, 0.5, 0.8, 0.7, right? So the lowest one was milk, about 0.5. We're thinking that's still roughly enough to keep it. If we come down here and we ran this as an exploratory factor analysis, look what happens to milk. Milk is so low you'd probably delete it immediately because you're not measuring. It's not sharing anything with your result. You've got 1.8, the rest are, you got a couple of 0.6s and the rest, this one's pretty, orange is pretty low too, okay? Now, why is that happening? It's happening because exploratory factor analysis only looks at the communal variance. It only looks at the variance in common, principal components takes the total variance and shifts it. This is only trying to find factors associated with the common variance. Common variance is less, each of these variables have unique variance associated with it. That's what's not getting measured here. That's not what's not getting included, okay? So that's why these numbers are all less than one. So if we look down here at the factors, again, you notice that we kept the first two, in this case the second one doesn't even have an eigenvalue greater than one. So we've kept one that it actually, this was the same as what you ran the principal components, right? So that's your initial starting point, but once you start running the principal axis factoring algorithm, okay? This is what you end up with. The second one doesn't have an eigenvalue greater than one. So you might not keep this. You might only use one in this case. Or it might be telling you that you just haven't measured this factor or the second factor very well. You need more variables. Cumulative percentage is down to 50% now. Notice the rotated amount is still exactly the same because the rotation doesn't affect the communalities. It just shifts the variance around. So in the rotated version, we get a little bit closer to one because we've stolen some variance out of the first factor. Now, again, if I was doing this, I would run every rotation I could to see if I could get a result that might look better, okay? But typically you find the other rotations, oftentimes are even more difficult to try and interpret. So, but it's worth a try. Scree plot stays the same. If we now go over and look at the loadings, okay? You notice they're different as well. Now it's called a factor matrix, by the way, as opposed to the component matrix. So there are some hints here to try and make you realize which one you've run, okay? But you see that now we've got, we still got a bread, burger, and tomato. Not, well, it's different. Before we had, well, yeah, these are the ones we had before, I think. We don't really have anything loading very highly on the second factor. So the second factor you could almost argue here is not measuring anything. You could make an argument be a pretty weak one for oranges because the loading is so low, okay? I mean, rounding it, you get to 0.5, but might be good for exploratory research purposes but it's probably not gonna get you a paper, okay? So you can see the drastic difference in the two ways that you run this. Now, which one would be more, oh, and then if we look at the reduced correlations, remember what happened when we had only the two factors in the principal components and we had 70% of the correlations were not reproduced well, right? Look at it here. When you run exploratory factor analysis, none of them, none of the residuals are greater than 0.05. So again, different model, different objective, we've reproduced all the covariances here in exploratory factor analysis. We didn't come close in principal components. Completely different result, okay? And here's your rotated factor matrix which is loadings. If you rotate it and all this does for us is again makes tomatoes look like it's almost cross loading but gives us a hint of a factor for fruit and vegetables. Well, it's actually fruit because tomatoes are fruit, not vegetable. And then it has a hamburger factor, okay? So the reason I did this was that you could see at the same time because typically you would never do this because you have in your mind when you start analysis which of these two methods you're gonna use. But I wanted to give you a picture of how different the results are. So if you're trying to build a scale, you do not wanna be using principal components analysis because that's probably not what's gonna drive a confirmatory factor analysis. You really want to understand the common variance that's in the latent constructs that you're trying to measure. So that's why oftentimes if you use principal components and you expect the results to look a certain way and then you put it into confirmatory factor analysis based on that and you find out it falls apart and it's not supported. And you think, oh, what the hell did I do? And that's the reason, is you ran the wrong analysis to generate your hypothesis so your hypothesis didn't hold when you got to the confirmatory factor analysis, okay? So I'm gonna stop here because I know you guys are all probably fading into oblivion and because I only had a few more slides on just tying this to confirmatory factor analysis but we can make sure you've heard enough for now. So any questions that we haven't hit along the way? Anybody? It's all clear now, right? Clear as mud, yes? Will we get a copy of the slides? I can provide a copy of the slides, absolutely. Okay, yep, yep. No, I will email those to you. You can distribute them. Okay, great. Thank you guys all for being here. Now, if I can get my computer to ever work the same as it did when I walked in here today, there are some other issues if you try and use SAS. SAS has some ugly output for, I think, exploratory factor analysis. You get negative eigenvalues and things that you have to interpret. So I think the SPSS output is a lot easier to kind of get in line with. But if all you have is SAS, read the documentation. That's all I'm gonna tell you. It's there somewhere. Yep, yes. Well, Amos is a component now of, it always was a component I think of SPSS. When SPSS owned it, you had to buy it separately. You still buy it separately but it's all through IBM and they've actually embedded it on the analysis dropdown, the analyzed dropdown. Yeah, and you can still use it independently. I'm not sure if you need the base SPSS system to hold the data file. Typically Amos is oriented more toward confirmatory factor analysis and structural equation modeling. So typically what Amos is very facile for is that if you have, let's say the output of an exploratory factor analysis. So you've collected your data. One of the other things that we didn't talk about, but if you're gonna move to confirmatory factor analysis, there are constraints on the number of variables that you need for each factor. So if your initial data collection, if you said, well I've got this factor that I think's there and I'm just gonna measure it with two variables and then I wanna create a scale, you may be shooting yourself in the foot because you may get to confirmatory factor analysis and there's a condition called underidentifiability where you don't have enough measured variables, indicators for all the factors to estimate all the parameters. Because basically when you get to confirmatory factor analysis what you're doing is equating the covariance matrix and now we're not doing correlation matrices anymore, we're in the covariance matrices and you're linking the model that you hypothesize, so that creates a theoretical set of, theoretical covariance matrix and you're equating that to the actual covariance matrix from the data and that's how you estimate all your parameters. So if you don't have enough indicators for each of the factors, you may have created a situation where it can't estimate them all. So you won't be able to get a scale. Assuming that you have all that, it's easier to build that model in Amos because it has a GUI interface where you can actually draw out your model, put in all your indicators, easily attach the data from your data file into that to make sure you've got your model specified correctly and then it goes through all the different, now we're into a statistical model as opposed to a descriptive model and when you get into confirmatory factor analysis, there are statistical tests, there are also a lot of rules of thumb because normally the statistical test isn't satisfied, that's another discussion, but you still get a number of tests on parameters and things like that that you don't get in factor analysis and what you do, you might take the result of an exploratory factor analysis, when you move it to confirmatory, each factor in exploratory analysis has every individual variable loading on that factor. When you move it to confirmatory factor analysis, you are definitely making a hypothesis that I only need to keep some smaller group of those variables on the first factor and that's gonna measure the first factor well enough for me and I can put a different set on the second factor and third factor and I can let them all be correlated whatever and that'll all be supported by the data. So you're really doing a test in confirmatory factor analysis to say, does the data really support a more definitive model than what you're generating in a pure factor analysis? Okay. This up eventually on genome news whether there's educational copy of things that present to it and they'll be freely available.