 Logistic regression analysis is a commonly used tool for binary dependent variables. A binary variable is a variable that receives the values of 1 and 0. And it's very commonly used for decisions that are either yes or no, whether something happens or not, whether a company decides to expand internationally or whether it decides to stay in the home markets, whether a person is sick or not, and that kind of data. To illustrate the logistic regression analysis technique we need to have some example data. And this example data are girls from Warsaw. And the girls range from about 10 years to about 18 years. And the dependent variable here is called MenArts. And that's whether the girl has had the first period or not. So we can see here that girls at the age of 10 normally don't have had the first period and then girls when they are 18 pretty much everyone has had the first period. And we want to explain this relationship between age and MenArts using regression analysis. There are a couple of problems when we apply normal regression analysis for this kind of data set. The first problem is that the regression line here goes over one. So the value here, the regression line gives the expected value of the dependent variable given age. And in this case, because the dependent variable is zeros and ones, the expected value is the probability, expected probability of having had MenArts. When we draw the line, we have a problem here because the predictive probability for girls that are 18 exceeds 1 and probability is bound between 1 and 0. Also we have negative probability here. This also causes a problem for regression analysis because when we have small numbers, small fitted values here, then all residuals are positive. So the error term can't be independent of the fitted value. So regression analysis, we are violating the no endogeneity assumption at least. And the predictions don't make any sense. So using a linear model for this kind of data is problematic for these two reasons. Using this kind of linear model would be acceptable if most of the girls would be around here. So the linear approximation would be okay because it doesn't really predict any negative values because we can't go beyond the range of the data. But if we have negative predictions or predictions that exceed 1 within the range of the data, then we have problems. This model is called linear probability model. And it can be used but there are typically better alternatives. One better alternative is to start discovering better alternatives. We need to think about what's the relationship like. And we can do a non-parametric analysis. For example, we take a rolling average from the data. So the idea of rolling average is that we have here about 4 000 girls. And then we take the first 500 here. We calculate the mean for these first 500. And then we put mark as small dot here. The average for these girls is zero because no one has at the minnards. Then we shift this window right to a bit. We check the next 500 girls. So we go from the second girl to the 501st girl like that. We calculate the average, we mark it here. Then we go to the third girl to 500 second girl. And we calculate the average for that sub-sample. Then we continue, we go here. We can see that the mean value is about 50 percent. And finally when we calculate for all possible windows, we calculate the mean. We get this kind of non-parametric curve. It's non-parametric because we can't express this curve as a simple function. We can see that this is an S-shaped curve. So first, when girls get a little bit older, some girls start to have minnards but not many. And once you hit about 13-14, then the rate of having minnards increases rapidly until it starts to decrease when you are about at about 15, when it pretty much everyone has had minnards except for a couple of exceptions. And then it flattens out at one. This curve is called the logistic curve. So here is the logistic curve. And the idea of logistic regression analysis is that instead of fitting a line, we fit this logistic curve, the logistic curve. And the interpretation of the result stays the same. So the line gives us the expected probability of a girl having had minnards given their age. But this line as we saw from the previous slide is a much better fit for the data. So the data, the relationship is not linear. Rather it follows an S-shape and the logistic curve is one such S-shape curve that we could use. And it's very commonly used. So we get the probability of having had minnards given the age from the model. The model can be expressed mathematically because all models are just equations and the mathematical expressions for this logistic regression model is as follows. First you have the linear regression model. So that's the linear probability model because we have one binary dependent variable. And the logistic model extends the normal regression model by taking a function of this fitted value. So we calculate the linear prediction using the observed data. And then we take a function here which gives us the logistic curve. And the inverse of this function is called the link function. And that's the logistic function and this is the inverse. Whether it's called an inverse function or a function doesn't matter. The important thing for you to understand is that instead of using the predictions directly we apply a function that the predictions that make the predictions are transforms the predictions from a line to a curve. Okay, so how do we estimate the model? We can apply all estimation. So we apply all estimation then we do diagnostics. So we get the residuals here. There's a residual. So we can calculate it. Then we get these. We can plot residual versus fitted which is one of the standard diagnostic plots. And then we can check the normality of the residuals. We have two violations of regression assumptions. First of all, the residual is not normally distributed. So but that's not really a big deal. It's only relevant in very small samples. Then we have a heteroscedasticity problem because the variation of the residuals here is a lot higher than the variation here because the variance is a square of the difference. Square of the residual. Then so we have a heteroscedasticity problem. We are in violation of the MLR5 and MLR6 assumptions. Whether that's a big deal or not we could use a robust standard errors. But there are also some computational difficulties when we try to apply least squares approach to this kind of problem. And because of those computational difficulties and because OLS is not ideal anyway because of violation of these assumptions we estimate this using a different approach called maximum likelihood estimation.