 Factor analysis is a very useful tool for validating measurement. The idea of factor analysis is that it takes in multiple indicators and then it answers the question, what do these indicators have in common, so it tries to extract or identify underlying dimensions from your data. The reason why we use factor analysis for measurement is that before we apply any reliability statistics, we have to study if the indicators are unidimensional, if so, then we use a unidimensional reliability index. If not, then we calculate the reliability statistic based on the factor analysis. Factor analysis also can be used to assess the hypothesis that the indicators are consequences of a common cause and in that way we can try to use factor analysis to justify causal claims where we say that the construct causes multiple items. The factor analysis techniques are there are two main variants, exploratory factor analysis and confirmatory factor analysis. In exploratory factor analysis it's an exploratory process where you give the computer your data set and then you ask the computer to give you three factors, two factors or how many factors you can. You want to have from the data and then the computer will identify the factors. In confirmatory factor analysis you specify the factor structure yourself. So you say that first three indicators for example measure one thing that is one factor, then the second three measure another thing that's a factor and then the remaining four indicators measure a third thing and that's the third factor and then the computer will estimate the model for you and tell if that model is plausible for the data. Exploratory factor analysis is easier to apply because you don't have to specify the structure yourself. You just specify the number of indicators and which variables you use and for that reason many people get started with the exploratory factor analysis instead and if you do data exploration or some initial analysis then exploratory factor analysis is quicker to do for your data. So exploratory analysis is the one that is typically covered first followed by confirmatory factor analysis. I will now demonstrate factor analysis using the exploratory approach and to do that we need some data. Our data are from Olympic Decathlon so we have the 10 sports that the athletes do that are 100 meters run long jump, short put, high jump, 400 meters run, 110 meter hurdles, Discus throw, pole wall, javeling throw and 1500 meters run. So there are 10 different sports that you are doing this competition and then you are rated based on your performance and overall ranking is determined by the scores. So you have to be a very good overall athlete to be able to do decathlon. So the data looks like this. So that's the first 15 observations. There are 100 meters in seconds, long jump, how many meters, short put, how many meters, high jump, how many meters, 400 meters run, how many seconds, 110 meter hurdles, how many seconds, Discus throw, how far in meters you threw it, pole jump, how high, how many meters, javeling, how many meters you threw the javeling and then how many seconds was the one and a half kilometer run. So what kind of dimensions does this data have? That's what factor analysis will tell us and we'll first do a factor analysis and we'll request two factors just to get started with something. So that's the two factor solution and before I explain the factors, it's important to understand what do these numbers tell us and let's start with uniqueness and communality. Uniqueness and communality are sum to 100 or 1. Uniqueness or communality first tells how much of the variation of this particular indicator, the two factors explain. So for example, short put, factors explain 94.5% of the variation and only 0.5% is unexplained. So the uniqueness is how much of the indicator remains unexplained by the factors. Ideally, if the factor model is correctly specified so that the factors perfectly match your theoretical constructs and the indicator, there are no systematic measurement errors, then this uniqueness here quantifies the amount of random noise in the indicators. That's an ideal case, whether that applies in any real case, that's another question. The communality is kind of measurement of reliability and this is an estimate of unreliability. So that's one way. Then we have two factors, we have MR1 and MR2. The MR simply comes from the fact that we estimated mean rest technique, you don't have to care about what that means. So we have a first factor and second factor and these are called factor loadings and they are in correlation metric here. So the idea here is that the first indicator correlates at minus 71 with the first factor and minus 0.22 with the second factor. The first indicator, first variable is very strongly associated with the first factor and then are a bit more weakly associated with the second factor. So let's just take a look at the first factor now. The first factor here, we identified that some of the indicators have negative factor loadings. We have to understand why that is the case. If we start to look at what are those items that have negative loadings, we have the 100 meter run, we have the 400 meter run, we have the 110 meter hurdles and then we have the 1500 meter run. So all these are running sports and what they have in common is that more time means that you are worse, the less time means you are better. With all these others you are throwing something or you are jumping and there more is better. So in these running sports less time is better, in these others more distance, more height is better. To make the results a bit more understandable, I will therefore now reverse score the times so that all variables indicate, more of a variable indicates that the athlete performs better. So I will reverse the signs of these all running sports and then we have this kind of factor analysis result. We can see that every indicator here loads positively on the first factor and the magnitude of the factor loadings differ. So how would we interpret the first factor? All indicators are positively associated with something. What's the thing we have to interpret? What is the underlying dimension that influences these indicators or variables according to these results? This first factor, if everything correlates positively with the first factor, then the first factor basically is how good the guy is. So how good of an athlete the person is. If you are a good athlete, then you perform better in all of these sports. So good athletes are expected to perform better than bad athletes, therefore all the items are positively correlated. The second factor here, we can see that the short put and javeling and discuss are positively associated. 1500 meters is negatively associated as is all the other running sports. So the second factor quantifies whether the person is better at sports that require strength versus sports that require running speed. So there is a trade-off if you are very bulky guy, you are good in these strength sports but you are more mass, therefore you are not that great in the running sports. So there is a trade-off and this second factor quantifies that trade-off. So we have a factor how good the guy is and we have a factor of whether the guy is better at running or strength sports. We would ideally like to think that there are two dimensions to this data. How good the guy is in running and how good the guy is in these sports that require strength. But this factor analysis solution doesn't answer that question. To answer that question we do something called factor rotation. So the factor rotation is a technique that re-orients the factor solution so that it's simpler to interpret. Typically when you apply a factor analysis and you have two correlated dimensions then the first factor will capture a little bit of both dimensions like we have running speed and strength captured by the factor how good the guy is. And the second factor will capture whether the guy is better at running or better at sports. When we re-orient the factor analysis using factor rotation then the factors will typically correspond better to actual dimensions in the data. So here after rotation we have the first factor strongly associated with all the running sports. So we have 0.84 here, 0.7, 0.6 and so on. And then the second factor is strongly associated with sports that require strength like the discuss and the shot put. We can see that even a bit better by re-ordering these indicators. So we re-order based on the first factor and we can see that the running sports are all the five largest loadings. Then we have the pole jump and then we have the strength sports here, the shot put, javeling and discuss throw. The first factor now clearly has an interpretation. It is related to running so that's the running skills or how good a runner you are. And the second factor is a clear interpretation. It's related to these strength sports and its upper body strength. The pole vault requires both, so it's loading both. This is called a cross-loading because it loads on two factors. First you have to run and then you put the pole into the hole and then you have to use the upper body to use the pole and get as high as possible. So pole vault requires both skills. We can see here that also that high jump has a high uniqueness. So it's not really related to upper body strength at all and it's not really related to running speed because you don't have to run fast. You just run to pace yourself and then you jump up. So jumping up is different from running fast. In long jump you have to, the better you are running, the faster you can get yourself going and the further you will jump, fly when you jump. So that requires running. And this way we can interpret the meaning, give meaning to these factors. So that was a two factor solution. We can of course get more than two factors. So there is quite a lot of unexplained variation here. So high jump 90% variation is unexplained by these two factors. So we can try extracting more factors. And whether it makes sense to do so is related to more what's your theoretical expectation and can you actually interpret the factors instead of a statistical question of whether we can explain more variance between the indicators. There are statistical techniques to decide the number of factors but it is a theoretical concern and it's about whether you can interpret the result anymore. Let's try three factors and see what happens. So that's the rotated solution and I have ordered the variables again according to the first factor loading and then the second factor loading. So we have three factors now. The first factor is the same running speed. Then the second factor is the same upper body strength. So we have there are strength sports here. And then we have a third factor that has the 100, 500 meter run and the 400 meter run and the long jump and not much else. So it's not about running speed as much as it's about running stamina. So it's slightly different. So this is whether you're good at running short distances that's explosive running speed and how fast you accelerate things like that. And this is whether you can keep up the running. And the upper body strength is the same. So we can divide running further into two sub dimensions. Whether it makes sense to do so is another question. In this case probably not. Probably it's better to just say that some people are better at strength sports and some people are better at running sports. We can also get four factors and we get the same factors running speed. Upper body strength, running stamina. And then the final factor is simply a high jump. So that receives its own factor and nothing else loads on the high jump factor. So when we start extracting factors typically we can go and get as many factors as we have indicators. And eventually we will get these factors as just explained a single indicator and nothing more. So the idea of a factor is to try to find an underlying dimensions from the data. And once we start to get these factors just tell that well then there's how good the guy is in high jump. Then it's not really a factor anymore in the sense that it's an underlying dimension. So probably with this data three factors if we're really interested in running stamina and running speed difference could be a good solution. Or we could just take the two factor solution which measures the running skills and the strength of the athlete. So it's an argument the choice of factors depends on what's your research question and what kind of abstraction you want to have for your data. In practice when we apply factor analysis to measurement scales for example surveys then and we want to measure five different things with a survey. Then we set the number of factors to five because we want to get five things from the data. And ideally the factor analysis demonstrates that the indicators correspond to the theoretical constructs that they're supposed to measure. Factor analysis is based on the correlation so it's useful to understand the relation between correlation matrix and factor analysis. The model implied correlations match the same principle applies here as in regression model and I'll cover that a bit later. But here we can see that factor analysis groups the indicators based on the correlations. So we have here first the running speed factor so all the running sports are highly correlated. So they are reflections of one underlying running speed factor. Then we have these others. We have the upper body strength. So those sports that require upper body strength are highly correlated. Then we have the running stamina factor. So some of the running sports require both endurance and speed. And then one thousand five hundred meter run requires endurance more than speed. And then we have high jump which is not loading on any factors because it is very really uncorrelated with any other sport. So high jump is a unique sport in that it doesn't really require strength and it doesn't require speed. It requires the capability to just jump very high.