 In this video I will show you one possible workflow for regression analysis. This workflow will address all the assumptions that are empirical testable after regression analysis. There are of course multiple different ways of testing the assumptions, but this is the way I like to do it. I'm using R for this example, but most all of these tests and diagnostics can be done with data as well, and most of them can be done with SPSS. Regression analysis workflow and any other statistical analysis workflow first starts by stating a hypothesis that we want to test. Then we collect some data for testing the hypothesis. After that we explore our data, so it is important to understand the relationships. Then we estimate the first regression model where we have the independent variables and the dependent variable. Then we check the results briefly to see what they are like and we proceed with diagnostics. So the diagnostics include various plots and I prefer plots over statistical tests. The reason is that while you can for example do a test for heteroscedasticity, that test will only tell you whether there is a problem or not. It will not tell you the nature of the problem. It is much more informative to look at the actual distribution of the residuals to see what is the heteroscedastic problem like. And also if you just look at eyeball these graphs you will basically identify the same thing that the test tells for you. So I don't generally use tests unless someone asks me to do so. Then when I have done the diagnostics I figure out what is the biggest problem. And once I have fixed the biggest problem then I go back to do a regression model. For example I may identify that there are some non-linear relationships that I didn't think of in advance. Or I may identify some outliers or I may identify some heteroscedasticity. I go back to fit another regression model where I have fixed the problem then I do diagnostics again. And once I'm happy then I conclude that that is my final model after the diagnostics. I do possibly nested model tests against alternative models and then comes the fun part. I interpret what the regression coefficients mean. So I don't just state that the regression coefficient is 0.02. I tell what it means in my particular research context. And that is the hard part in the regression analysis. To demonstrate the regression analysis diagnostics we need some data. We are going to be using the prestige dataset again. And our dependent variable is the prestige this time. And we are going to be using education income and share of women as independent variables. So that is a regression model. And the regression estimates are here. We have gone through these estimates before in a previous video. So I will not explain them in detail. Instead I'm going to be focusing now on the assumptions checking. So how do we know that the six regression assumptions actually hold? The assumptions are shown here. The assumptions are that all relationships are linear. So it's a linear model. Observations are independent. So independent observation comes from our research design. And in cross-sectional study it is difficult to test. If you have a longitudinal study then you can do some checks for independence of observations. No perfect collinearity and non-zero variances of independent variables. That happens if two or more variables perfectly determine one another. So if you have a categorical variable of three categories, then including three dummies leads to this problem. Because once you know two dummies you know the third value. Also non-zero variance, if you have zero variance for example, if you are studying the effects of gender and you have no women in the sample, then you have no variance in gender. So that is another implication. Another reason why this could occur. We know that this is not a problem in our data. Because if it was a problem we couldn't even estimate the regression model. Because we got regression estimates that indicates that we don't have problems with the third assumptions. The other assumptions are a bit more problematic, because they are about the error term and we can't observe the error term. So the fourth assumption was the error term is expected value of zero, given any values of independent variables. Then error term has equal variance. This is the homoskedasticity assumption. And then the error term is normally distributed. How we do test these assumptions about the error term, these three assumptions is that we use the residuals as estimates of the error term. So if an observation is far from the regression line in the population, there is a large value of the error term, then we can expect that it also has a large residual. So we can use the residuals as estimates of the error terms. So normally doing regression diagnostics is analyzing the residuals. And that's quite natural, because if you think that the residual is the part of the data that the model doesn't explain. And the idea of diagnostics is that we check if the model explains the data adequately. Then it's quite natural to look at the part of the data that the model doesn't explain for clues on what could go wrong. I normally start with the normal QQ plot of the residuals. And the normal QQ plot is something that quantifies whether the regression or residuals are normally distributed. So it compares the residuals here, or these are calculated based on standardised residuals. There are different kinds of residuals for an applied researcher. It doesn't really matter if we know them all. What's important is that your software will calculate the right kind of residual for you automatically when you do these plots. Then you have normal distributions. We're comparing residuals against normal distributions. We can see here that they are roughly correspond. So we have a line here indicates that residuals are normally distributed. Here is a problem. We have a chi-square distributed error term here. So the residuals here are further from mean than they're supposed to be. And here we have uniform distribution of the error term. And that creates this kind of S-shape in the normal QQ plot. While the normality of the error term is not an important assumption in recursion analysis, I nevertheless do this because it's quick to do and it identifies outliers for me. And it gives me kind of like a first look at the data. Here, with the actual data, I can see that there are residuals for a normal distribution. So I'm happy with this. This is an indication of a good fitting model within the sixth assumption. R labels these possible outliers. So newsboys has a large negative residual. So newsboys is less prestigious than what the model predicts. And farmers are more prestigious what the model predicts. So farmers don't make much money and you don't need high education to be a farmer. But farmers are still appreciated a lot. So that's an other extreme case. So normal QQ plot shows that the residuals are roughly normally distributed. And that's a good thing. So we conclude no problems. Then we start looking at more complicated plots. The next plot is the residual versus fitted plot. And the idea of residual versus fitted plot is that it allows us to test for nonlinearities and heteroscedasticity in the data. So the fitted value is calculated based on the recursion equation. We multiply these variables with the recursion coefficients and then we compare residual versus fitted. Ideali, there is no pattern here. There are residuals and fitted values. They are just spread out. So this is an indication of a well-fitting model in this regard. Here we have a heteroscedasticity problem. So that plot contains data with the variation of the residual and also the variation of the error term is a lot less here in the middle and then it opens up to the left and to the right. So this is a butterfly shape of residuals and this is the worst kind of heteroscedasticity problem that you could have. But it's not very realistic because it's difficult to think what kind of process would generate this kind of data. Then here we have a nonlinearity and some heteroscedasticity problem. So this is a megaphone opening right and it appears that there's slight nonlinearity here. And here severe nonlinearity. So the right formula, right shape is not line but it's a curve here. And this is a weird looking data set that has a nonlinearity problem and also it has a heteroscedasticity problem. The plot, we want to have something that looks like that. No particular pattern. So typically in these diagnostic plots that plot residual against something else is looking for a no pattern. Our residual versus fitted plot looks like that. So we have marked again these observations with high residuals in absolute value and then we can see that we have fitted values. There are very few professions for which the model predicts high prestigiousness and most observations are between 30 and 70. Can we infer from this plot? We can infer that maybe the variance of the estimates decreases slightly to the right. So we don't have much observations here. So we don't know if this is actually the same this person here but we just observe two values from that this person. But it is possible that if you look at this person here, it's that much and we look at this person here, it's slightly less. So it is possible that we have a heteroscedasticity problem. So the fifth assumption does not hold whether that is severe enough to warrant using the heteroscedasticity robot standard errors. That is a bit unclear because this is not a clear case of where we should use those. Then we check for outliers. So this far we have been looking for evidence for heteroscedasticity and nonlinearity. We have found evidence for heteroscedasticity but not really for nonlinearity. Then we are looking for outliers as the final step using the fourth plot and the residual versus leverage plot tells us which observations are influential. So we are looking here at observations that have a high leverage and high residual. So we have general managers who have high leverage and high residual in absolute value. So we want to look for observations with residuals that are large in absolute magnitude. In Stata for example, Stata uses the square residual here because that always goes up so it's easier to see which observations have large residuals. We have to look at small negative values or large positive values here. So it's not as simple as if it was if this was square of the residual. So minister has leverage, newsboys has a large residual and then general managers is here. The cook's distance is another measure of influence and observations with large cook's distance are potential outliers. As before in the deep house paper to deal with these outliers we would be looking at why the prestigiousness of one occupation would be different than others. So for example general managers, they earn a lot of money so their salaries are high and therefore their predicted prestigiousness should be high as well because it depends on the income and they earn less than what the model predicts which means that the model over predicts their prestigiousness because of the high income. So that could be one reason why we could drop general managers. But you have to use your own judgment because this is 102 observations so dropping one observation increases our sample size by one percent approximately. That could be a consequence. So the leverage is the distance from the mass center of the data conceptually and cook's distance is another measure of influence. So we identify outliers using this plot. Then we start looking at the final plot which is the added variable plot. So added variable plot quantifies the relationship between a dependent variable and one independent variable at a time. And this plot is interesting. It tells us it plots education that is the focal independent variable regressed on the other independent variables here, the others and it takes the residual. So this is the part of education that is not explained by income or share of women. So that's if you think about the vendiakram presentation of recursion analysis. This is the part of the education that does not overlap with any of the other independent variables. Then we have prestige, the recursion of prestige on other independent variables and we take the residual. So we take what is unique of prestige and unique of education after parceling out the influence of all other variables in the model. And then we draw a line through that data. And this is actually the regression line of prestige on education. So this is one way to calculate regression line is to regress both variables independent and dependent on all other independent variables and then run a regression analysis using just one independent variable. It produces the exact same result as would producing including this education all of the other variables directly in multiple regression analysis. This plot allows us to look for non linearities and heteroscedasticity in a more refined manner. So what we can identify from here is that the effects of income look pretty weird. We want to have observations that are banded as a band around the regression line and here we can see that it looks more like a bit of a curve but it goes up here and then flattens out a bit and we also have much more this person here than this person here. Now we have done the diagnostics. So we did the normal QQ plot, then we did the residual versus fitted plot. We did the influence plot or the outlier plot and added variable plot and now we have to decide what do we want to do with the model. And some ideas that we could try is to use heteroscedasticity robust standard errors. Our sample size is so small and there is no clear evidence of a serious heteroscedasticity problem. So in this case I would probably use the normal conventional standard errors. Consider dropping general managers and see if the results change. Even if we decide to keep general managers in our sample that could work as a robustness check in the paper. So in the deep houses paper they estimate the same model with the one outlier observation without the outlier and then compare the results. And we should consider log transformation of income. Consider income in relative terms makes a lot more sense anyway because when you think of raises for example or you want to switch to a new job then you typically want to negotiate a salary increase relative to your current level. Also additional salary, how much it increases your quality of life depends on the current salary level. So if you give a thousand euros to somebody who makes a thousand euros per month that's a big difference. If you give a thousand euros to somebody who makes five thousand euros a month it's a smaller difference. So income, company revenues, that kind of quantities we typically want to consider in relative terms and to do that we use the log transformation.