 GLM-models with fixed effects are challenging to estimate, and many GLM models with fixed effects can't even be estimated consistently. Conditional logistic regression model is one workaround to the problem that fixed effects in a logistic regression simply can't be estimated. These kind of models are often used in, for example, CO-selection studies and sometimes in panel data sets where the dependent variable is binary. Let's take a look at first what is the problem that conditional fixed effects logistic regression addresses, or what does the conditional in estimation mean. When we talk about linear models, then we have two commonly used panel data analysis techniques to deal with unobsturated originating. We have the fixed effects model where we model these consistent differences between the groups or the clusters using this fixed effect that we estimate separately for each cluster. This is known as the dummy variable model because the simplest way or the easiest way to understand how these models are estimated is considering that we add a dummy variable for each group or each cluster in the data. Another commonly used technique is the random effects estimation, and this is called also the error components model because we model the unobsturated originating here in the random part as a part of the error term and not as a fixed parameter whose values are estimated for each case. Importantly, this second formulation makes the random effects assumption that all unobsturated differences are uncorrelated with the observed predictor variables. When we move to the GLM world, we have a fixed effect GLM model for cluster data. What we add is a link function G here around the fixed part and same applies to the GLM model for cluster data. We have the link function, but it also includes the error term or the random effect here. How are these two models estimated? In the linear world, we often just use a GLS transformation and then apply OLS recursion analysis. The fixed effects and the random effects use slightly different transformations, but the OLS step is the same. In nonlinear models, these are estimated very differently. The generalized linear mixed model for cluster data is estimated by numerical integration, so we integrate out the random effect to calculate the likelihood for these model and the data, and then we are maximized at likelihood, and that produces us the maximum likelihood estimates. This is more challenging to estimate. In recursion analysis or linear model context, we have three main strategies for estimating this model. None of them work in the case of nonlinear model. Let's take a look at first the transformation. The GLS transformation for fixed effects model is basically that we calculate the group mean or cluster mean of each of our variables, and then we subtract those group means or cluster means from the observations, so we cluster mean or group mean center the data, and then we apply OLS recursion to the center data. Centering eliminates the unlopsured effect AJ. In this case, when we have a nonlinear model, centering can't be applied because this AJ is inside the link function, so we can't just subtract it by using a simple transformation. The second way of estimating is the dummy variable model. Include dummies for all clusters or groups and estimate the OLS. That works because OLS is unbiased, but it does not work with GLMs because ML estimates of GLM models are generally biased. They are consistent, but they're biased, and the normal sample size that we have for clusters, if we have a panel data, let's say we have 10 observations, 20 observations for each firm, we have a very small number of observations from which to estimate each AI, and we simply can't get these right. They will be biased, and consequently the full model will be inconsistently estimated. The third strategy that we can apply in the linear case is the correlated random effects approach, or also known as the moondock technique or hybrid approach, where we include cluster means of the predictor variables in addition to the original variables. That produces the within effect consistently in linear models, but that too doesn't work in the nonlinear case. The reason is that if we take the cluster mean from a small number of observations, then there will be some estimation, let's call it measurement error in the cluster mean, and those measurement errors make this entire thing inconsistent. This third problem is discussed, for example, in this blog post by Alison. In the context of logistic recursion analysis, this problem is addressed pretty well in Alison's book, and he explains that there is an incidental parameter problem, and this relates to the fact that maximum likelihood estimation techniques have proven to be consistent, which means they approach the correct value when sample size grows to infinity. The problem is that if we have panel data, growing the sample size typically means getting more firms into our sample, and that means that we have more parameters to estimate. So also the number of parameters, if we estimate a separate intercept for each firm or each cluster, grows to infinity, and then all the theory behind maximum likelihood estimation no longer works. Conditional fixed effect estimators does away with this problem. So let's take a look at the normal recursion model. So we have the recursion model models, the odds of the two outcomes. So we have the probability of positive outcome divided by the probability of negative outcome, and that is exponential of the linear predictor. So the AI is in an inconvenient place here. We can work a bit on the math and move AI here, but that doesn't really take us anywhere. What we can do now is to make an assumption. So if we assume that there is only one positive outcome in each group, then we can write the probability slightly differently. So we can write the probability like that. So the probability of a positive outcome for an observation within a group is the prediction for that observation, the exponential prediction for that observation divided by the sum of the exponential predictions for all other observations. So we can think of this as a relative probability of the current case divided by the sum of relative probabilities of all cases. And here we can see that AI is a multiplier on the numerator and denominator. It can cancel out, and now we have a probability that does not involve the fixed effect. To understand what this means, let's take a look at the next example. Let's assume that we have two hockey teams. They have won three games and seven games out of the previous games. So if we have these teams facing each other, based on these data, the probability of the first team winning is this relative probability, three seven, divided by three plus seven, which is the sum of, let's call them relative probabilities. And this is seven divided by three plus seven, so it's 30% against 70%. And this is the idea of conditional logistic recursion analysis. It does not matter how good these teams are. So this might be a game where the teams are highly skilled. That might be the final of the national league, or it might be some lower league level game. Game level variable AI, unobserved effect, cancels out in both cases, and we are only comparing the relative probabilities of these companies. If you want to see how that's derived, just showed here's the derivation from Wikipedia, but I don't think that's very useful for an applied researcher to know. But if you're interested, that's the matter. Let's take a look at how it works in data codes. So I'm generating some data here. We are generating three clusters. So we just set observations to three. We generate group variable, which is just one, two and three. And then we generate, the group is an unobserved effect. We generate a linear prediction using normal, it is with X plus the group effect. And that gives us, when we exponentiate, gives us the relative probabilities. So how we calculate the probability of each case is that we calculate the denominator for each group. So the denominator is the same for each observation in the group. We take the sum of these relative probabilities, we divide it by the denominator, we get the p-values, and those are the probabilities. So it doesn't really matter if the unobserved effect a is higher or not. We are just comparing the relative probabilities, not any absolute level of skill or innovativeness or whatever is our variable. Importantly, these probabilities always sum to one within a group. So there's always exactly one observation that gets a positive outcome within a group. Of course, it can be some number other than one. We can have like two or three, but the idea is that the number of positive outcomes is fixed. Let's take a look at example data and discuss a bit about how these results are interpreted from these models. So our example data is from case control study of low birth weight baby mothers. So we have some mothers with low birth weight babies. And for each mother we have a control mother who is similar to the case mother, but did not have a low birth weight baby. So this is a matched-pair case control study and the observations are clustered on pair ID. So this one cluster always contains two observations. One is low birth weight, another one is a normal birth weight baby. We run conditional logistic recursion analysis and we get some estimates. But before we take a look at these estimates, we need to do diagnostics for this model to make sure that the model makes sense. I will not do the diagnostics in the video because this is just an introduction. But if you're interested in, for example this book by Hossven and Co-authors, provides one chapter worth of explanation on how to do diagnostics and they have also state a code that implements the diagnostics that they recommend. So this is useful to take a look at if you want to apply these techniques in your own research. Let's move to interpretation. So how would we interpret these coefficients? Fortunately, we can exponentiate the coefficients and those still work as odd ratios like we have in a normal recursion analysis. I'm not personally a big fan of odd ratios because I think they are a bit awkward to interpret, but many researchers do find them useful based on how frequently they are reported. Another way of looking at the coefficients is to convert them into elasticities. This is explained by Kemp and Da Silva. They also provide a state of program for doing this conversion. Here we would interpret, for example, the coefficient of smoke that's elasticity to 0.7. The interpretation is that when two mothers are otherwise comparable, the smoker has a 70% higher chance of getting a low birth weight baby than the non-smoker mother. Importantly, this is a relative effect, so it's not 70% of its points, but it's like if the non-smoker had a 10% chance of getting a low birth weight baby, then the smoker has a 17% chance. So it's a relative effect instead of absolute percentage points effect. So this is a useful way to interpret this. Then we can also apply a pseudo R-squares, particularly the two pseudo R-squares, my favorite, is directly applicable. How we calculate the two R-square is that we simply calculate predictions. So we calculate the predictive probabilities and these are always calculated assuming that there is one positive case and one negative case, so the probability is always sum to 1. And then we compare the predictive probabilities for the cases and predictive probabilities for the controls. We can see that for the controls we have 30% predictive probability for the cases, we get 70% predictive probability, so this model is pretty good at discriminating cases from the controls. And the two R-square for this model would be then 0.4. Another thing that I often like to do after logistic regression analysis is to calculate adjusted predictions. So then in normal regression analysis context the adjusted predictions are calculated by first setting everyone to 80 pounds of weight, then 100 pounds, 120, 140, 160, 180, 200, 220, 240. So we calculate these effects or predictions assuming everyone is 80 pounds, everyone is 120, everyone is 200 pounds and so on. And then we plot the adjusted predictions. This does not work in conditional logic. The reason is that if we adjust everybody, then the results for the predictions are pretty unpredictable because we are comparing relative probabilities. If we adjust the weight of everybody, then the predictive probability on average will stay the same because the probabilities will always sum to 1 within clusters in this case. So margin generally is not very useful and the status documentation of margin simply says that these effects are not allowed and the reason is that the marginal effects depend on other observations in the cluster. Marches does calculate linear predictions but that's not very useful and it also calculates probability of a positive outcome assuming that the fixed effect is 0 and that is pretty useless too. So margin is pretty much useless after conditional logistic regression model. What we can do instead is just to use predict to calculate our own what if scenarios and I'm comparing the base scenario against two what if scenarios. So I have P as the base prediction, that is the prediction given by the model and predictor variables are the ones that we observe and then we adjust all the low birth babies mothers to be non-smokers so we adjust all of those to be non-smokers and then we adjust later everybody to be a non-smoker and we calculate new probabilities for each case under these two conditions. When we compare the base probability, the predictive probability for a low birth baby mother is 70% if those mothers were non-smokers this would go down to less than 60%. So it's a pretty substantial difference, more than 10% its points. Interestingly when we adjust everyone to be a non-smoker so we basically eliminate the effects of smoking then the probability of these low birth weight mothers even if there's non-smokers again goes up and this is one of the features or let's say or weaknesses of this conditional logistic recursion analysis so it does not check how I would do so if we're looking at an individual level effect like we would do in a low birth weight baby scenario but it always compares one observation against others and if we want to introduce some kind of policy that reduces the amount of low birth weight babies in the population then this would be a pretty useless analysis but this kind of scenario and what if analysis can be useful in other contexts so for example if we go to my isokie example from previous slide if we have a team and they have a star player that star player is out of the game because of an injury then that automatically increases the probability that the other team wins so the probabilities are really tied and because of just one team can win a game the same applies in management research context for example our choice of CEO, a company must have a CEO and they can have one CEO only so if we have five candidates, the probability of hiring one increases then the probability of hiring others will decrease so asking questions such as how much less likely a leading candidate or selected candidate would have been had his experience been five years less is a meaningful question because here we know that it makes sense that the probability of other cases increases when probability of one case decreases so whether these predictions make sense depend on the context of the study the conditional estimation can be applied to other GLMs for example Allison talks about how this applies to Poisson regression model here the conditional assumption is that the count of the dependent variable the sum of the counts is constant within group now this conditional fixed effects is a bit complicated to interpret and we might ask do we really need this estimation technique and this question do we really need conditional logistic regression analysis can be separated into two sub questions the first question is are fixed effects required and let's assume that we have a hypothetical case where we have ten companies observations are set to ten in myStata code that are generated that are choosing CEOs and each company has two CEO candidates and they will always pick one and the experience varies so some companies get more experienced candidates than others and then the experience also of these candidates vary within company and the choice of the CEO is completely determined by experience so the more experienced candidate is always chosen so the within R-square of this model is exactly one here is the data so you can see that some companies receive less experienced applicants some companies receive more experienced candidates but the company will always choose the most experienced one among those that applied to that company if we ignore clustering which has run a normal logistic regression analysis we will find that experience is only weakly associated with selection so the coefficient of experience is just 0.5, not very large considering that experience varied between a bit less than 0 to a bit more than 10 and CEO to R-square is 0.004, which is a fairly small number but if we do a conditional logistic regression analysis we can see that the model explains the dependent variable perfectly and because of the perfect prediction this effect of experience is actually infinite and we don't have any standard error so this is the perfect prediction scenario that I talk about in the context of logistic regression in another video let's take a look at the data again so here these dummy coefficients are estimates of the fixed effect again they can be consistently estimated but we can see something from here so we can see that the fixed effect gets more negative for these companies as the list goes down and we can see that the experience gets more positive as we go down the list so we can see that there's a strong correlation between the fixed effects that are unobserved and the observed experience so the random effects are assumption fails here in this case if we simply analyze this data with logistic regression analysis or even random effects logistic regression analysis we would reach a completely incorrect conclusion experience is the only thing that matters those other models would show that the experience has little effects because they can't take into account the unobserved fixed effects on the cluster level the second question is do we need a logistic model and this is more nuanced and if you are wondering about using conditional logistic regression model there's quite a lot of articles that talk about the trade-offs between using a logistic model compared to security model if we look at the logistic curve if we look at this area from 0.02 to 0.08 it's flat so it's linear, it's straight and if we have predicted probabilities that are in that area then using normal linear regression analysis will be completely fine and a lot simpler to do than using conditional logistic regression analysis this figure is from Kimoneda and they provide other similar figures and simulation evidence about the relative performance of these two companies these two estimation approaches conditional logistic model and linear model also summarized nicely in this table by Gomila and he notes that these linear regression models generally easily interpret and logistic regression analysis rarely has any substantial advantages over a linear model of course there are scenarios where the non-linearity of the logistic regression analysis is an advantage but in many cases the linear model should be at least considered because of its simplicity