 Another important procedure in SPSS, when you're analyzing data, is something called factor analysis. Now, I like to think of it as looking at your data and trying to find shadows. In this picture, what you have are shadows, those are the black figures that you see. It takes a moment to figure out that you're looking down and there actually are people but kind of sticking straight out. And so in this photo, where you're going from is sort of a three dimensional origin, that's the person itself, to a two dimensional variation with the shadow. What's interesting about that is you maintain most of the useful data. You can tell that they're people that they're walking, you can probably even tell some things about how tall they are, what they're wearing and so on. What you've done is you've made things more efficient. Now, in the data world, that's called dimensionality reduction, where each variable is a dimension. And too many variables can actually be really problematic. You're trying to boil things down a little bit. And you can think about the saying less is more less equals more. More specifically, that is less noise, and fewer unhelpful variables in your data set equal more meaning, because that's what you're trying to do, you're trying to extract meaning. Now, when it comes to factor analysis and related techniques, I have one very important piece of advice and that is to be practical. At all points, you want to remember, what is your goal? So what is the goal? Well, the goal of factor analysis, I'll tell you what it's not. It's not an exercise in analytical purity. You're not there to show that you know how to go through all the steps in the approved format. Really, you're working with your data because you're trying to get some understanding. So the goal of a procedure like factor analysis is useful insight. Try to follow the rules. Do what you can to make sure you don't make any obvious mistakes. But remember, you're not bound by the mathematics, you're bound by what the data tells you about the people. Another way of looking at that is use factor analysis or really any other procedure for its heuristic value. That is, it suggests possibilities to you as you analyze the data, as you're trying to get insight to people. Now, that's sort of a philosophical discretion. Let me show you how this actually works in SPSS. You're going to need to download from the course files a folder that says data here at the end, and from it the cars dot save data set. This is the one that we use in hierarchical clustering as well. And then you want to open up the SPSS syntax file that goes with this particular section. Now, the easiest way to open the data set is simply to double click on it and you'll be ready to go. I do have some syntax you can use if you saved it to your desktop. I've got it open already. So let's take a quick look at the data set. We have a collection of cars listed down the side and attributes like miles per gallon and so on and gears and the transmission and carburetors. That's great. Now I will have to make a very important confession here. This is a very, very small data set for factor analysis. It only has nine variables other than the identifier and it only has 32 cases. Really you would want to have at least several hundred cases and let's say several dozen variables before you can do this really reliably. But this example works and it actually is really easy to see how it's happening and how to interpret the results. The first thing we're going to do if you look at the syntax is we're going to do a default factor analysis and it's actually a misnomer because it's not a factor analysis, it's principal components analysis, but it's in the factor analysis command within SPSS. So let's come up here to analyze and down to dimension reduction. Remember I said that's what this is called. Well pick factor it's our only choice there. And what we need to do is choose the variables that we're going to use to see what we can compress what goes into what so we don't need the name of the car that's just an identifier. We can take the rest of these however and we can put them under variables. Now we've got a lot of options here I'm not going to do any of them I'm just going to hit okay for right now. I'll make the output window bigger and here's what we get from the default analysis we get a text output of the commands that were generated by the drop down menus. We get something called communalities. Each variable brings with it one unit of standardized variants. That's based on how spread out the scores are. And if you standardize them, then you have a variance and a standard deviation of one for each. And the extraction tells us how much of that variance is really able to get constituted through the process that we're doing. An important one right here is the total variance explained because what this has done is it has created components remember I said this is actually a principal components analysis here, which while it has profoundly different philosophical underpinnings from factor analysis, the difference has to do with which came first the factors or the observed variables and truthfully, most people treat them as relatively interchangeable. And if you're using them for heuristic value, it's not going to be a big difference. But what we have here are two components. We have one with 5.472 units of variance that 61% of the original variance of the nine variables. And then another one with 2.341 I'm getting those numbers from right here. And you can see it held on to these two which collectively add up to about 87% of the variance. Now the component matrix shows the relationship between the original variables and the two components, these are like correlation coefficients. You can see that miles per gallon is strongly negatively associated with the first component and really not associated with the second. But number of carburetors has a pretty strong association with each. And so that's a way to start to look at it. But it's going to be a lot easier if we do certain modifications to this. In fact, I'm going to just delete this output right here. And we're going to start over, I'm going to make a few changes. Let's go through each of these options. First we go to descriptives. And I don't really feel like I need the initial solution. So I'm going to unselect that I'll hit continue extraction. This is the actual algorithm that SPSS uses to work through the relationships in the multi dimensional space. You'll see right here, it's principal components. That's why I said this is really a principal components analysis. You've got a lot of options here. Now in many situations, maximum likelihood would be a very good answer. I'm going to choose principal access factor and simply because it's the classical version of factor analysis. I don't need to see the unrotated factor solution, but I do want to see something called a scree plot. And that is a graph that shows me maybe how many factors I should keep. I'm going to come down here and change the maximum iterations for convergence that has to do with the math that's done, I'm going to change it to 50. Then I'm going to come to rotation. What you get here is a multi dimensional space. And sometimes it's a little easier. If you rotate the axes, it can increase interpretability. Now, there are a lot of different methods. Varimax is a method that maintains orthogonal relationships that makes all of your axes perpendicular to each other. There are situations where that's really good. But truthfully, for exploratory purposes, which is what we're doing, I like to use what's called an oblique rotation. That allows your factors to be correlated with each other. They don't have to be totally perpendicular. I'm going to use direct oblimate. Promax is another really good choice. But it usually is for larger data sets. And I've got a tiny one here. Now here I can get a rotated solution. I don't think I really need that. But I do want to see the loading plot. And I'm going to increase the maximum number of iterations to 50. I'll hit continue. We'll come down to scores. And you can save the factor loadings as scores. And there might be situations where you want to do that. But because I'm using factor analysis where it's heuristic value as a way of suggesting what variables go with others, I'm actually not going to do that. So I'm going to hit cancel. And then finally options. This is where you get to talk about excluding cases. I have a complete data set. So I don't need to worry about that. But the coefficient display format. Now, I'm going to sort it and then I'm actually going to have it completely erase small coefficients. Now I've done this one before. So I happen to know that a value of 0.6 under normal circumstances, that's really high. But given my very small data set, this seems like a reasonable choice. And it makes the solution very, very clear when we look at it. So I'm going to hit continue. And then there I'm going to hit OK. I've got my output here. And the first part's pretty similar, except it doesn't start with unit variants for each of these. That's because I'm not doing principal components anymore. I'm doing principal access factoring. And so the math behind it's a little bit different, but we don't need to dwell on that one. Total variance explained, you see that we still have two factors. And the first one accounts for a lot of the variance. The second one accounts for a fair amount. Also, and these are very close to what we had with the principal components. The screen plot is a very simple line plot that suggests how many factors we might want to keep. Now there are several different rules you can use for interpreting this. One is anything that's above a value of one, because one is what it would be if a variable explained simply one unit of variance, but that's what it brought with it. You want factors that can explain more than that. And you see we have two that do a lot more than one. These others are sort of straggling down. The other rule is to look for a bend in the line. And you do see a strong bend right here. So three is where the bend is, we're justified and staying with two. There are other methods that get more involved about checking for the slope of this line and finding things that are above that slope. You can do those at another time. This is a quick demonstration. Now what we get next are three matrices, we get a factor matrix, a pattern matrix, and a structure matrix. They're all associated with each other. And I've got a little note here in the syntax that explains them. I'll come down here. The factor matrix is the association of each variable with each factor. And it's similar in nature. It's analogous to are the correlation coefficient. That's the one that we're going to be focusing on. The structure matrix tells us how much each variable is predicted by the factors because the idea here is that factors come first and variables come second, using what are called the unique and common contributions. So a factor might contribute something on its own, compared to the other factors, or it contributes together. And then the pattern matrix is an indication of each factor's unique contribution to variables variance, those can both be important in different situations, they can help you interpret things. But for right now, I'm just going to focus on this first one. There we go, the factor matrix. So let me go back to where I was. And when we come up to the factor matrix, what you see here is because I suppress values with an absolute value of less than 0.6, we have this totally clear separation. Factor one is strongly associated with the number of cylinders in the car. So more cylinders higher on factor one. And then displacement, very high, miles per gallon is negative, but very strong. And then we have weight in tons, very strong in horsepower. This is really the big factor cars that are really big are going to score heavily on factor one. Factor two is composed of the number of gears more gears, the quarter mile time so the less time it takes to get through the quarter mile that is the faster it is, that's because it's negative here the higher it is automatic or manual, you have to know that zero is automatic and one is manual. So these are manual transmission cars and those with more carburetors. This is really the fast factor. And that's where sports cars are going to this one, this one has the Cadillacs and the Lincoln's and this one has the Ferraris and the Lotuses and so on. And that makes perfect sense. It's really easy to see why that would be the way it is. And then if you come down here, this plot is also really helpful. It's got the two factors we have factor one across the bottom. That's our big factor. And you can see that weight goes on that one, displacement goes on that one cylinder. And then we have number of gears and miles per gallon. Obviously you're on the low end factor two is the fast factor, more carburetors, more horsepower, more cylinders, and lower quarter mile times. And that makes a lot of sense. And so this lets us know that we could boil down our data to really just these two factors sort of how big is the car and how fast is it. And that can give us a much more concise image of our data and allows us to extract more meaning. And that is the overall purpose of a procedure like factor analysis or principal components in SPSS.