 In this video, I will introduce you to the important concept that the linear model implies a correlation matrix. This is something that you will typically run into in more advanced text, but I think it's a very useful principle to understand even on the first course on quantitative analysis. So linear model is any model where all the relationships are linear, for example, regression model and correlation matrix quantifies the linear associations, which means it's variable, two variables at a time on a standardized metric. So what does it mean that the linear model implies a correlation matrix? Let's take a look at this regression model in the path diagram form. So we have three independent variable X1, X2 and X3 linked to dependent variable Y with regression coefficients for these regression paths. Then we have some variation U, the error term that the model doesn't explain, and then we have these Xs that are allowed to be freely correlated. The correlation is shown by this two-headed curved arrow. What this principle says that the correlations between the X variables are what the data gives us. So we can just calculate the correlation with X1 and X2, and that is taken as it is. Then we say that the correlations are free. But the correlation involving Y depends on the model. So we can say that the correlation between X1 and Y depends on these correlations and the model parameters here. So it's implied by the model. What that means is that in practice we start from X1 and we trace paths. So we can check how we get from beta 1 to Y in different ways and then we trace all possible paths. We take some of those paths and then that will provide us what is the correlation between X1 and Y. Let's take a look at an example. This is an important concept because if you understand this concept it will allow you to understand certain properties of regression analysis in a lot deeper level than you otherwise would. And it's also very useful when you think of factor analysis or structural equation models or other more complicated models. Let's do the tracing. So the idea of path analysis tracing rules is that we pick two variables. If we want to calculate the correlation between two variables, we pick X1 and Y. Then we check how many different ways we can get from X1 to Y. And we can only go along these arrows down or we can travel up and then along one curved arrow and then back down again. So from X1 we can get to Y in three different ways. We can go along the direct regression path here. We can go from X1 one correlation to X2. We can't go this anymore because we can only take one correlation down to Y. Then we go X1 to X3 and down to Y. And that's all three paths that we can take from X1 to Y. So this gives us the following equation. So we can check that the correlation between X1 and Y is the sum of the direct path plus this correlation of path times the direct path from X2 plus the correlation with X1 and X3 times the direct path from X3. What's the interpretation of this correlation here, the equation? It is that the correlation between X and Y equals the direct effect plus any spurious effects because X1 is correlated with X2 and X3 that both have effects on Y. So we are saying that this correlation actually here is a product of this relationship of interest plus these spurious other causes or common causes of Y that correlate with X. So that's the idea. So we get these three paths. We multiply everything along each path and then we take the sum of these paths. So here the path from X1 to X2 includes the correlation here and includes the regression path here. So we multiply those to get the value of the path. We sum all the paths that gives us the correlation. The importance of this rule will be made clear in a few slides. So that gives us the correlations, but we also need the variances of variables. So those are implied by the model as well. Now we are working on correlation metric, which means the correlation is one, but that one is something that the model implies as well. So when we have the variance of Y, we have to think how many different ways we can go from Y to somewhere and then come back. So we can go to error term, we can go up once, then we turn back. So that is the variance of error term times one and times one again because we go back and forth. Then we have Y to X1. The variance of X1 is one because we are working with standardized data and we come back. So we have beta one times one times beta one on the way back, so beta one squared. The same for X2 and back and X3 and back. Then we have a way of going from Y to X1, then one correlation to X2 and back to Y. So that will be beta one times the correlation times beta two and we can take the same path, the opposite direction, X2, correlation and back. So we get that. And that produces us, gives us the following math. So we have the direct effects, beta one squared plus beta two squared plus beta three squared. So we go from X1 and back, X2 and back, X3 and back. So that because we go back and forth, we have beta one twice or beta one squared because we multiply things together. Then we have the correlational paths between two, we go X1 and X2 and then back. And we go the X2, X1 and then back. So that's where we multiply by two. We do that for each variable, its pair of variables, and then we have the variance of error two. So that gives us the variance of Y, which in correlated data correlational matrix always one. So we can use these rules to calculate the full correlation matrix between all variables in our data. So we have here the variables of Xs. The variance of all variables of Xs are ones, because we are working on correlations. And then the correlation between Xs are something that are given in our data. And then we have these equations for correlation between Y and X1, Y and X2, Y and X3. And then variance of Y, which is the covariance of the variable with itself. So that equation. And this is the variance of the error term, not the actual value. So why would this kind of model be or principle be useful? The reason is that if we know this correlation matrix from the data, then we can actually calculate the regression estimates. So we can also work backwards. So we know the correlations in the data. And then we can find out what set of regression coefficients beta one, beta two and beta three. And the variance of the error term would be compatible with this correlation matrix. So we can find the model parameters beta one, beta two, beta three. And variance of U, the error term that produced this implied correlation matrix. So let's do that. Heckman's paper gives us a correlation matrix of all the variables. So they give the correlations for the variables before doing the interactions. So we can calculate this part of the model one here using the correlations. We get estimates that are very close to one another. So we can see that this is minus 23. And this is minus 23. So they are mostly the same. There is some inaccuracy in precision because these are just two digits precision. And the correlations are two digits precision. So we have some rounding errors. And also we have these interaction terms here in the model that in their model that we don't have in this model because they didn't present the correlations between these interactions and the other variables. But the results are mostly the same. There is one important question now. If we look at the p-values, the p-values or they don't present the p-values but they present the stars. So we tend to have less stars than they have in the paper. So it's an important question when we replicate something. If we don't get the right result, the same result, then why that's the case? To start to understand why the p-values from our replication are different from Heckman's paper is useful because it teaches you something about statistical analysis. So remember that the p-value is defined by the standard error, the estimate and the reference distribution against which we compare the t-statistic which is the ratio of the estimate and the standard error. The estimates here are about the same as the estimates here. So what could be different is the standard errors. Somehow we calculate standard errors differently than they do. For example, because we don't include these variables in the model, it is possible that our standard errors are larger. That's an unlikely explanation but it's possible. And because our standard errors are larger than in Heckman's paper, then that leads to the p-value differences. So let's check if that is a plausible explanation for the differences. To understand if that's a plausible explanation, we have to consider where the standard errors come from in our regression analysis. One way to calculate the standard errors is an equation that looks like that. And remember that we calculate the p-value by comparing the estimate divided by standard error against the t-distribution. So could our standard errors be different? So are the values that we use different from the Heckman's paper? The first thing that we notice is that there are the r-square here. This is a r-square in the formula refers to a r-square of one independent variable on every other variable in the model. So we calculate the standard error for one variable by calculating this r-square of that variable on every other independent variable. So that r-square j tells what is unique in one independent variable compared to other independent variables. This term here has some additional meanings that I will explain in a video a bit later. So if we omit variables. Heckman's study had 15 independent variables in the first model because they had three interaction terms. We only have 12. And we know that if we add variables to a model, then r-square can only decrease. So r-square can only increase. So our r-square should be a bit smaller than Heckman's r-square because we have less variables in the model. We don't have the interactions. So if this r-square j decreases, then 1-r-square, this subtraction result increases and this causes the denominator to increase here. So we have a larger denominator here which basically means that the standard errors will be smaller if we have just based on that consideration. So if our standard errors are smaller, then we know that our p-value should be smaller as well because the estimate divided by standard error will be larger when standard error gets smaller and it will be further from zero, which means our smaller p-value. Okay, so what happens? What's on there here on the top? That's the variance of the error term. In our paper, it is our model 75. So it's 1-r-square is the variance of the error term in standardized results. So ours is 0.75, Heckman's 0.75, ours is 0.78. So there's a 4% point difference. So because we can expect this denominator here to be smaller and the numerator to be a bit larger, then we are expecting the r-square, the variance and the standard error to be perhaps about the same. So we can't really look at the standard errors and conclude that there is a clear reason to believe that our standard errors will be substantially different. So we conclude that based on looking at where the standard errors come from, then we can't see a clear reason why our standard errors would be larger than in Heckman's paper. So that's an unlikely explanation. So why do the p-values then differ? If we have the same estimates and we have no reason to believe that the standard errors differ substantially, then what remains as a plausible explanation is that we are comparing this estimate divided by standard error, the t-statistic, against a different distribution than Heckman, and that will produce the different p-values. So if we divide our p-values by 2, we can actually get the same stars as Heckman mostly. So that's an interesting observation. Our p-values appear to be twice as large as the p-values by Heckman. Why would that be the case? Well, this is an indication of Heckman actually using one-tail tests instead of two-tail tests. So the difference in one and two-tail tests is that in one-tail test you only look at the one end of the tail, so you will get the same significance level with the smaller value of the test statistic. So here you have value of 1.7, required for the five-person significance level, and here with two-tail tests, because this area here must sum to five percent, we have about two for the same problem. So with one-tail test, you basically take the p-value of two-tail test and you divide it by half. Because it is a convention to use two-tail tests, then reporting one-tail tests, doing one-tail tests and not reporting that you did so, is basically the same as claiming that you did two-tail tests. And that's a bit unethical. Generally, there is very little good reasons. I can't name any good reasons for using one-tail tests. And for example, Abelson's book on Statistical Arguments explicitly says that are using two-tail one-tail tests instead of two-tail tests is practically cheating. What's interesting is that when Hekkon's paper was under review, so he has published the full revision history of his paper, and they included a mention that they use one-tail tests, and you can see many papers actually do use one-tail tests without really justifying that choice. So the choice is unjustified, but they nevertheless want to do it, presumably because it makes the p-value smaller and results look better. But they had, they mentioned, which is the right thing to do, that the p-values are one-tailed, for some reason that part of the Regerson table food was eliminated from the published first. So the rule of thumb, don't use one-tail tests. There is really no good reason for using one-tail tests, and if you do, report it clearly, but you really shouldn't.