 After watching the GLM videos, you must have the question of, should you use these models, and if so, when and why? And that's the question that I will answer in this video. The question number one, out of two questions of whether GLM is required or is useful, is transformation required? So what do you think is the nature of the relationship with your independent variable and the dependent variable? Is it linear and additive? So do all the independent variables work separately so that the effect on the dependent variable is their sum? Or is it exponential and multiplicative, which means that you multiply the effects of independent variables together to get the effect on the dependent variable? Or is it perhaps the S curve when the effect is first very small, then increases, and then it's very small, because everybody is at 100% already? So this is a question that is about theory and what kind of relationships you expect. It's not about question of how the dependent variable is distributed. So this is primarily a modeling decision, not a data decision. My practical recommendation is that you should always start with linear recursion analysis and then do diagnostics, do an added variable plot, do a residual versus fitted plot, and see if there is evidence of nonlinearity. If there is, then you consider these alternatives. Of course, you may have a strong theoretical reason to believe that an exponential model or an S curve model is preferable, but still doing the recursion analysis is very cheap. It doesn't cost you much time, and it'll tell you something that you didn't know before most of the time. So starting with recursion analysis is a good idea. Then the third consideration is that some textbooks and some articles say that you should transform the dependent variable to reduce heteroscedasticity so that your standard errors will be correct. And this decision has nothing to do with standard errors whatsoever. This is not of which transformation you apply. It's driven by theory. What do you think is the best explanation for the data and the consideration for standard errors is secondary to that? And you can always use robust standard errors to deal with any heteroscedasticities anyway. So this is driven by theory, not about standard error consistency or about the kind of data that you have. The next question is that once you have decided that you want to transform your dependent variable somehow, you can also transfer independent variables, but this is mostly focused on the dependent variable. Should you transform the dependent variable and then apply regression on the transform values, or should you apply generalized linear model, where you transform the fitted value instead of the dependent value? There are simple points for and against both decisions. Simple points for transforming the dependent variable is that it's simple to do. So there are no computational issues. Recreation analysis will always give you results. And you can also use OLS diagnostics. So regression analysis diagnostics are very useful. They are more developed than diagnostics for GLM. And you can find more resources on how to do those. Also, regression analysis is well understood. For example, the nature of multiplicative effects, as I explained in previous videos, is something that many researchers don't fully understand. So regression analysis is more commonly understood by readers and reviewers than GLM. There are points against transforming simple points. Transforming a variable with a few discrete values is problematic. If you have a count variable with 1, 2, and 3, then trying to do some kind of inverse Poisson transformation on that wouldn't make much sense because it still has three discrete values. If you have 1s and 0s, the binary dependent variable, transforming a binary dependent variable will give you another binary variable. So it doesn't do anything. And then you have the issue that if you want to, for example, explain company size and you want to explain that with an exponential function, some companies have zero revenues. So how do you deal with those zeros? Because you can't take a log of zero. And then you need these awkward work rounds where you add plus 1 to the dependent variable before you take the log. So these are simple points against transforming. There's a more rigorous way of looking at this issue. Let's look at this GLM model and the transform model. Typically, we are interested in explaining what is the mean of the data or the expected value of the data given the independent variables, in which case we look at the nonlinear regression model. If we apply this transform dependent variable model and we treat this coefficient here as if there were estimates for this original model of interest, then they are actually inconsistent. So the transform equation is an inconsistent estimator for the original equation. So statistically thinking, you should never transform the dependent variable. You should always use the GLM because the transform variable is inconsistent estimator of the GLM. That may not be enough to convince all the people, but let's take a look at examples. So I have this data set here. This is the prestige data set that I've used before. And we have the distribution of income for professions that are more than half men and distribution of income for professions that are less than more than half women. So we are men-dominated and women-dominated professions. And we are interested in knowing whether men-dominated professions make more money than women-dominated professions. And this is something that we would typically want to answer with men make 20% more or 50% more instead of saying that men make 4,000 Canadian dollars because the percentage is something that we typically think in these kind of comparisons. So how do we do it? When we look at percentages and we do transform dependent variable regression analysis, we get some estimates here. Then we can calculate predictions using these estimates. So the predicted lines are here in the equation, in the plot here. We can see that the predicted lines here are less than the actual sample means. So the model predicts, the sample means a bit incorrectly. It predicts them too low. So they are predicted in erroneously. And they also, the model predicts the difference between the men and women, when men and women-dominated professions to be smaller than what it actually is. So both the actual means and the difference between the means are predicted incorrectly. The difference is not great, but it's noticeable. So based on these considerations, the GLM approach should always be preferred over transforming the dependent variable. Of course, doing the transformation of the dependent variable using OLS, doing the diagnostics, that's a good starting point. But in the end, doing the GLM is more rigorous and that's what your final, the end product of your research should be. There's a nice blog post about this from William Gold, who is the founder of Stata. And he makes a strong case and with some nice references that that's actually how you should do it. So don't log transform the dependent variable, use the Poisson GLM or QIML estimate instead and with robust standard errors. That gives you better estimates than the regression on the transform dependent variable. So what are the practical recommendations? Once you have decided that you want to use one of these transformations, then what's the modeling technique that you should apply? So linear additive model least squares always, no reason to use anything else. That's OLS is the best and weighting least squares could be slightly more efficient in some scenarios, but it's not worth therefore to do that. If you have exponential model with multiplicative relationships, then if you know the distribution of the dependent variable, given the fitted values, then use the maximum likelihood estimation of the generalized linear model with the correct distribution. So if you know that it's Poisson, you know it's negative binomial, you know that it's something else, then apply the normal GLM. If you don't know what the distribution is or you're uncertain about the distribution of the dependent variable or you know that it doesn't follow any of the distributions that your statistical software supports, then apply Poisson quasi-maximal likelihood estimation with robust standard error. So this is kind of like, it's a similarly safe choice than using OLS is for the linear model. For the S-quare models is the same thing. If you know the distribution, so if you know that you are using fractional response data and you know that the dependent variable is beta distributed given the predicted values, then use a beta recursion analysis, so maximum likelihood GLM with the correct distribution. Otherwise, if you don't know the distribution of the dependent variable, then use a Bernoulli quasi-maximal likelihood with robust standard errors. So if you have fractional response data, then basically I would always recommend that you use just the normal logistic recursion analysis for that because it works. You would think that it doesn't, but it actually does as long as this approach has been a program to your computer software. Now, this has nothing to do with the transformation of the independent variables. So this is about the dependent variable. Transforming independent variables is okay and you can consider the log transformation or sometimes even exponential transformation of the independent variables to get the model that you think explains your data well based on your theory and then you estimate it with either OLS or GLM. This is more, this is about what you do with the dependent variable. The final question is that are, is this GLM and transforming the fitted value versus transforming the dependent value? Is it a big thing? So let's do an empirical example. So we have here two models. We are using the Prestige data. We have a years of education here. We have the predicted predictions from these two models, transform dependent variable and GLM effects on income. When we look at the regression coefficients, we can see that there is a 7.5% difference. So this is 1.0.19, this is 0.128. So 7.5 difference that is substantial in many methodological papers. We think that 5% bias is something that you can ignore, but this is 7.5% difference, something that we should care about. Also, when we look at the predictions here, we can see that the transform dependent variable systematically under predicts how much there are professions that require high education actually make and this blue line here is a lot better fit to the data. So that's empirically, it's not a huge difference, but it's something that I think we should be concerned about because the fix is rather simple. Now the final question is that if and when I get papers to review what authors use a transformation on the dependent variable, should I recommend that those papers are rejected because they don't use the GLM approach or quasi-maximal ideal estimation of Poisson instead of the transformation of the dependent variable? No, I would not say that this red line is worthless. I'm saying that the blue line is better and I would probably recommend that authors to take a look at some articles that are cited here that explain why the blue line is better than the red line and then tell them to make an informed decision.