 Normal regression analysis is a very convenient technique because it will always give you some results. Maximum likelihood estimation, on the other hand, can sometimes fail and to understand why it fails will allow you to travel with your models and make informed decisions on how to get the model to work. In this video I will show you an example of logistic regression analysis. The purpose of the video is not to demonstrate the logistic regression analysis feature specifically but more generally what could cause a maximum likelihood estimation process to fail. Let's take a look at this data and we have 8 observations, x1, x2, 2 independent variables, dependent variable y that receives values of 1 and 0. We'll run a logistic regression analysis explaining y using x1 and x2 and see what happens. The analysis setup is here. We'll be using two different software just to demonstrate software differences. This is R and this is data syntax for running this model. The results are in. What happens? What do we note first? The first thing we note is that lots of things are missing from the state output. We don't have significance tests, we don't have standard errors, we don't have the overall model test. We have results that are missing. Another thing is that the software gives us different results. R says that the effect of x1 is 15 and data says that the effect of x1 is 33. Effect of x2 is 30 and effect of x2 according to R is 6. These are significantly or substantially different results because normally if we interpret the coefficients using old ratios, we exponentiate them. The difference between 15 and 30 is on an exponential scale, that's a huge difference. What do we do? Do we just pick one of these two sets of estimates and report that set as if there was no problem? Well, we need to understand what's going on. Also we note that the likelihood is zero here and the likelihood, the local likelihood is zero here as well, which means that the likelihood is one. That's a very unusual scenario, so it means that getting this kind of data from this model would be 100% probability. Getting any other observations, any other values for the y-variable would be impossible from this model. You don't have that kind of perfect model, so what's going on? Then we also have this warning and the successes and failures are completely determined. R gives a bit less user-friendly warning, just numerical probabilities, numerical 0 and 1 occurred. The important thing about warnings is that when you get a warning, that is software telling you that there is something going on that you should pay attention to. So warnings are not some inconveniences that you can just ignore and then report the results. If you got any, the warning is something that you need to then spend some time understanding what is the warning telling you and why is the warning occurring and then what you can do about the warning. You should not report any analysis with a warning unless you know what the warning is and have made an explicit decision not to care about the warning. Generally, we want these warnings to go away. So what's the cause? Let's take a look at the data set a bit more closely. In this case, the problem is the variable x1, so we can just take x2 out. What do we see here with x1 and y? We see that when x1 receives values greater than 4, y is always 1, and when x1 receives value less than 4, y is always 0. So the x value here perfectly predicts the value of y. So that's the thing. And y, that will be problematic for maximum likelihood estimation. Let's take a look at how maximum likelihood estimation works. So this is the R analysis. And maximum likelihood estimation always starts with some kind of initial guess. So the computer is feeding an S-curve because this is a logistic regression analysis. And the first guess is that the S is quite... It's not very steep, but it goes up. So it goes up for x1 instead of going down. And the estimation then proceeds by trying different values for the coefficient x, so that the curve would fit the data better. And in this case, we originally had the curve fitting here, so it predicts this observation to have about 60% probability. And then we make the curve steeper and steeper. We can see that this observation is explained or predicted better and better. The problem here is that there is no limit on how steep the curve can be. So the steeper you make the curve, the better it predicts these observations. And you can make it indefinitely steep. So there is no limit on how much you can increase the x1 coefficient here. And it will always make the curve a bit steeper. So we can see that it's not straight up yet. We could still make it a few pixels more steeper. So the coefficient of x1 just goes to infinity if we allow the process to continue. What will happen to the likelihood or the log likelihood in this case, it will go to zero. And it gets to zero when every observation is predicted perfectly. So we don't have a maximum for this likelihood because the likelihood can never be exactly zero. It just goes to very, very close to zero, but we can always make the curve a bit steeper to make the log likelihood closer to zero. So the maximum of this log likelihood here does not exist. The consequence is that the maximum likelihood estimates for this model don't exist either. So our maximum likelihood estimate is indeterminate because making x1 larger and larger will always fit the data a bit better. The increase in fit is marginal, but we can say that x1 coefficient 50 would be the correct value because the coefficient of 51 would fit the data better. So the estimates don't exist. So what do you do? And this is a scenario that is so well understood that statistical software have checks for this. So this is for Stata user manual. And if we run the logistic model without the as is modified that I had before to force it to run, Stata says that no, we can't run it because x1 predicts the data perfectly. The estimates don't exist. And they have an explanation about it. So there are a couple of pages explanation in the user manual. What causes this problem, how Stata deals with it and what you can do about it. The problem is that not all possible scenarios are programmed in your statistical software. So there are scenarios where maximum likelihood estimation can fail and there is no specific check. And then it will just fail and you have no warning indicating why it failed. The perfect prediction that's well understood, you can rely on software catching up. But now let's take a look at another problem where the software doesn't catch it before estimates. So this is another variant of the same analysis. We had one more observation. So we had a ninth observation with values of x1 at 11 and x2 at zero, which is the same values that we had for the eighth observation, but they are the y variable resist the value of zero. So now we cannot predict perfectly because the prediction calculated from x1 and x2 is always the same. And if we predict one perfectly, then we don't predict zero and vice versa. So we can't predict perfectly using this data. So what will happen is that the perfect prediction check will not trigger. Stata will try to estimate it or will try to estimate it as well. And we again get a warning. So we have convergence not achieved warning. So Stata tried to estimate it couldn't find a maximum of the likelihood. It went through 1600 iterations, which is the default limit, and then it just gave up. You can also of course increase the limits, have Stata try 10,000 different sets of estimates. It will still cannot find the maximum because for this model it doesn't exist either. So Stata tries gives up and so what do we do about it? We see that we don't have standard errors for one of the parameters. That's an indication that we have a problem that we have to deal with, at least because we want to report that the standard error or if not that, then at least the p-value and we have nothing to report. So that the missing standard errors indicate that there's some kind of problem. We can see also that the likelihood here says that it's not concave and that gives us some information that is useful for troubleshooting. I will not go through the troubleshooting procedure in this video, but just to demonstrate what's available for you, what the not concave means. When we estimate maximum likelihood, then we do trial and error. This is from the video where I demonstrated the maximum likelihood estimation of the population mean using this data. So we can see that when the values are 2, 3 and 4, then a good estimate for the population mean is 3 and that's actually the maximum likelihood estimate. If we try any other values to likelihood function, we get smaller likelihoods. We have the actual likelihood here and then we have the log likelihood here. What's important about the log likelihood that it's a curve that bends down, so it kind of curves down and we say that this is a concave curve. It's a curve that curves down and the concave curve has the second derivative which quantifies the curvature here always negative. So if you have a curve that is concave, then the secondary derivative is negative. If the secondary derivative is negative and this is concave, then we know that there's a peak somewhere and that the peak is our maximum likelihood estimate. What will happen is that if this curve, for example, is flat here, then it's not concave because it's not bending down all the time. We wouldn't have a maximum likelihood because we have multiple different values of the parameter, the estimate of the mean, that are equally good from the maximum likelihood perspective. Also we could have a curve that goes down first and then curves up, so that would not be concave either. So that's what the non-concave means. The maximum likelihood is not something that is easy for us to estimate. We can check what's actually the problem by looking at the matrix of the second derivatives here which tell us how strongly this curves down and we can see that there are a couple of zeros there. So we have these zeros here and that indicates that we have a problem with these parameters. The troubleshooting on exact interpretation of this is something that I will leave for another video. Let's take a look at the problem. We have missing standardars here which is an indication that we need to do something, definitely. We have a warning that three failures and two successes are completely determined. The logical thing to do next is to ask which two and which three. To get which observations are predicted perfectly, we can use the model, even if it's not converged, we can use the model to calculate the actual predictions. So we can see here that predicted values for these three zeros are exactly zeros. For these two, the predicted values are exactly ones and that's the warning. The predicted value for this is very close to zero, so if we would allow data to go on forever doing the estimation, it would probably estimate or predict this to be zero as well and this to be one. So we have basically seven observations that are perfectly predicted and two that are not. So what's going on here? It's not perfect prediction because these are not predicted perfectly, they can't and for that reason data doesn't catch the problem. The logistic regression model with more than one variable, can be understood as this kind of surface. So we have X1 here, we have X2 here and then we have the Y on this axis and we can see how the observations depend on X1 and X2. The circles here are the values that are actual values, the ones. The circles down here are the actual values of zeros and the position of the circle indicates the values of X1 and X2 variables for that observation. Then the cross here is the predicted value on the surface. When we do maximum likelihood estimation, we want to adjust the surface by adjusting the coefficients of X1 and X2 so that the predicted values are as close as the observed values as possible. And what will happen again that if we make this surface indefinitely steep, so we can make it as steep as possible but this one observation is always, the predicted value will always be in the middle of the surface and it can be predicted perfectly because you can't predict one and zero at the same time. So this set of X1 is 11, X2 is zero corresponds with two different values of Y so that's why you can't predict perfectly. But the problem is the same, the coefficients of X1 and X2 grow large. The interest goes towards minus infinity and the log likelihood increases without limit. So you can always make the surface a bit steeper and then it will fit the data a bit better but there is no limit on how large X1 and X2 and how small the interest can be. So what do you do with these problems? There are four options how you can actually use the analysis. Option one, just if you use data and you didn't do the R analysis you wouldn't have noticed that the software gave radically different results. Choose the results that you got, ignore the warning and present the results and in your paper. I can't tell you how common that is but I'm pretty sure some people will do that. So understanding the warning requires effort and if you have some estimates that you could report and not go through the extra effort, some people probably will just do that. That's a bit unethical because the software with the warning tells you that there's a problem that you should pay attention to and then you're ignoring evidence of a problem and reporting the results as if there was no problem. The second alternative is trial and error. This is something that I have done a lot before I started to think that maybe I should understand what the computer is doing. So you just try, you drop cases, you drop variables until you get the warning to disappear. So you run and run and trial and error without understanding why sometimes the error appears, sometimes it doesn't and then you pick one of the analysis that doesn't produce the error. This is a bit better because at least you're trying to do something for the problem but this trial and error blindly could lead you with the suboptimal model. For example, you're dropping a control variable because the model doesn't converge because of the control there and then instead of getting the model to converge with the control you are doing a model that doesn't control for some explanation that you would really like to control. So this is not an ideal case. The third alternative is a bit better. So if you use, for example, logistic regression a lot this perfect prediction issue that I demonstrated here is a well-known thing. So any decent book on logistic regression analysis will tell you at least the first case but probably also the second case. Stata user manual will explain you both cases that I demonstrated and what stated us in those scenarios and why. So you can try and learn each special case and how to deal with them separately and it works if you just want to use a small number of analysis in your life. The problem is that the special cases for different analysis for example if you do negative binomerecus analysis there are different special cases. The number of special cases that you have to learn is quite large. Then the fourth option is to understand the estimation principle. So what do they are, the secondary values and the likelihood, what do they mean, how do they depend on the parameter values that the computer is currently trying and then you can see what's the problem. So this is of course more difficult but in the long run it will make you a better researcher because you can do diagnostics for your model in a way that just trying to memorizing every special case doesn't allow you to do. So these are the four options that allow you to present some results. The one is unethical, unethical to ignore warnings. Trial and error is bad. This is good but it's not the ideal case and in ideal case you understand what the software is doing. There's also the fifth option which is ignore the model. So give up, don't do the model. For example if you are just doing a robustness check you have your main analysis results, no warnings and then in a robustness check you have where you analyze a different model that's not that important you get a warning. So should I spend a day or a week troubleshooting it, should I spend a month studying it first and then a week troubleshooting it or should I just leave the analysis out? Leaving a problematic analysis out is a better alternative than ignoring the warning, ignoring the problem and reporting the results anyway. I do this all the time when the problem is not important I don't want to spend my time dealing with problems so it's a perfectly viable option that's something you should consider.