 Tämä video esim. on esim. nr. optimisaation, jolloin kontekstun maksimenlaikeudestaan. Mitä nr. optimisaation on ja miten sääntää nr. research care? Ensimmäisenä, jolloin paljon nr. teknikkoja, tai jolloin paljon statistiikkoja, voidaan vain ottaa dataa, ja käyttää aljelbaan, että ottaa nr. Nr. estimates, voidaan vain ottaa nr. computerissa, ottaa nr. parameters, ottaa nr. likelihooda, ja ottaa nr. parameters, ottaa nr. parameters, ottaa nr. parameters, ottaa nr. maximum is the likelihood by trying different values. The second question is, why should nr. applied research care, because this is something that the computer does for you, so you don't have to do this manually. The problem is that sometimes the task that you give to the computer is too challenging for the computer to solve. For example, you can have a model that is not identified, or you can have a model that is empirically under identified, or sometimes the computational algorithm for calculating the estimates simply fail. So what do you do in that situation? For example, if you're a Stata user, and your screen fills up with this kind of stuff, so you have here the iterative estimation, so you have iterations, and Stata will just print the same likelihood over and over and over and tells you that something is not concave. What exactly is not concave, and what does not concave even mean? Of course, you can just do trial and error. For example, simplify your model, change their options for the estimation or the optimization algorithm, and so on, and hope that that solves the problem. Another approach is to understand what the non-concave means, why something is non-concave, and what you can do about it when you understand the problem. Or you can have this kind of estimation that terminates with error, and you have an error message about numerical derivatives. What are derivatives? If you have done calculus in high school, you know what the derivative function is, but we want to have estimates, we are not interested in any derivative. So why should we care about any derivatives? If you understand what the computer tries to do, then you should understand why derivatives are important in optimization and why the fact that we can calculate derivatives is a problem, and what we can do about that problem as well. Or you may get this kind of error message that tells you that the Hessian is not negative semi-definite. What is a Hessian? What does it mean that it's not negative semi-definite? I have another video where I go into that in more detail, but this video just gives an overview of what these concepts are and how they relate to numerical optimization. In another video I go through in more detail how we interpret the Hessian to help us to find what exactly the problem is, why the model does not converge and what we can do about it. So the first problem that you can encounter is that your statistical software does not give you estimates. It can either go on forever, like in the first example, it just prints those iterations until it reaches the maximum limit, which in state is 16,000, or it prints out, goes on a little while and then prints out an error message for you, but all the same you don't have any estimates to interpret. So that's the first problematic case. Second problematic case is that it's possible that you get these, these messages about something being not concave, messages about Hessian, you get messages about derivatives, but you get estimates. So can we trust these estimates? So we have these, something is not concave, something is backed up, we get estimates. So should we get estimates, is everything fine or should we care about this stuff and then understand what this stuff means before we go on and interpret these estimates. In this particular case, the model is not identified, which means that the estimates cannot be meaningfully interpreted and, or at least some of the estimates cannot be meaningfully interpreted. But data doesn't give us any indication in the actual output. If you look at the estimates in history, the not concave gives us a signal that the model is not identified, but that's the only signal that we get in this particular case. This case on the other hand, we get all kinds of error messages or notifications, we have these optimizations, which in the BFGS optimizations, which in the BHHH and Hessian has contracted that kind of things, but actually in this case the estimates are trustworthy. So when do these errors or notifications here matter and when can they be safely ignored. This is something that we'll talk about in this video and in another video, where I talk about the technicalities of the numerical optimization in Mordhida. What does numerical optimization do and what is actually being maximized? Let's take a look at the likelihood function that I've used in another video and our task is to estimate population mean. We assume that the population standard deviation or population variance is one and our sample consists of three observations, two, three and four and our task is to find the population that has the maximum probability or maximum likelihood of producing these observations here. This is the likelihood function so it gives us the likelihood. So what is the likelihood of getting these three observations when the population mean is, for example, at one? What is the likelihood at two and what is the likelihood when the population mean is at three? So the three here is the maximum is at three so the three is our maximum likelihood estimate. In practice for computational reasons we don't optimize the likelihood function directly instead we maximize the log likelihood. The likelihood is simply the logarithm of the likelihood function and there are a couple of advantages to using the logarithms. One is that this log likelihood is quite commonly a concave function. So a concave function means that when we start from here then we always turn right or we always curve down. So the idea is that from every point of this curve we can see every other point. So it always goes to the same direction, it curves right and it never curves left. The raw likelihood instead at first, if we start from here it curves up and then it curves down and then it curves up again. So we first curve left then we curve right and then we curve left again. So that is not concave, this is a concave function. Concave functions are easier for numerical optimization than non-concave functions. So that's one of the reasons why you get the not concave warning. It just tells you that computer is having some difficulties at some point of the likelihood function. How exactly do we find the maximum likelihood here? So it's pretty easy to see here that the maximum likelihood is 3 because we have calculated the likelihood at every pixel from minus 2 to 6. This is like a couple of hundred different likelihoods of different values, we just choose the largest one. So that's one possibility. Calculate the likelihood at every possible value and pick the largest one. In practice that is not doable because the mean could be from minus infinity to plus infinity and if we are estimating more than one parameter if we estimate the mean as standard deviation for example we have to consider all possible combination of the parameters. And that's just way way too slow and sometimes not even possible to calculate with the computers that we have now. So in numerical optimization what we do is that we apply some math and instead of trying to find the maximum we are actually trying to find the place where the derivative of this likelihood or log likelihood function is zero. So if you remember the high school calculus class the derivative gives us the direction to which the curve is going at a particular point. So the derivative here is positive because the curve goes up. The derivative here is negative because the curve goes down so the derivative gives the slope of the tangent. So tangent is a line that we draw that touches the curve at one particular point and shows the direction to which the line is going. The green line here is the derivative. So these are the derivatives. The derivative first is positive on the left side of three and then it's negative on the right side of three. So it means that when we try different values of mean here the likelihood will always increase when we go from left side of three towards three and when we cross three then the likelihood starts to decrease because the derivative is negative. Then we have also second derivative which is the derivative of the derivative and it's the purple line here shows us that the slope of this first derivative is always negative and when the second derivative is always negative then the function is concave. If the second derivative becomes positive then it's not concave. If it's always positive then we say that the function is convex but typically when we maximize something we want our functions to be concave instead of not concave or even convex. Some computers will actually do this slightly differently. So some computer implementations may not actually maximize the likelihood but instead they minimize the negative of the likelihood in which case you will see things like convex function and that kind of things. But instead of which I use as an example here we always maximize the log likelihood instead of minimizing the minus log likelihood. So how do we find the maximum? There are different computational techniques for doing so and I'll explain first an easy technique. So let's assume that we just want to calculate this first derivative. We try to find where it's me, zero and we want to calculate the first derivative as few times as possible. So let's say that we calculate the first derivative at zero first and we see that the derivative is positive. So let's go and then calculate the first derivative at another point. Let's say we use the point five. We see that the derivative here is negative. The first derivative in many problems we know that it's a continuous function. So if it's positive in one point and negative in another point then it must be zero in at least one point between those two points. So we can try another value and we know that the zero point is between zero and five. So one logical thing is to calculate it at two point five and we can see that okay, the zero point is now it's between two point five and five because two point five, the derivative of two point five is positive and five the derivative is negative. So we can try something between two point five let's take from the middle and we get three point 75 so that okay, that's negative derivative there. So the derivative must be zero somewhere between two point five and three point 75. So we can narrow down where the zero is and we split it again. We get three point twelve. The derivative is negative. So we know that the zero is between two point five and three point twelve. We take something between, we get two point eighty five and now we know that okay, that's a negative value for the derivative. Zero must be between two point eighty five and three point twelve. So we get closer and closer to the actual value of three where it's zero and next we get two point ninety seven and that's close enough to zero, the derivative that we conclude that that's our convergence point. So we found the zero of the derivative or the maximum of the likelihood at approximately two point ninety seven. This is called the bisection method and it's a simple method to understand. But this is not commonly used because it's rather slow so we can see here that this is two point ninety seven is correct only. It's correct only to the second decimal so the third decimal is incorrect already. If we round it to one decimal, it's round to three so it's correct to one decimal on two digit precision it's incorrect because three is the correct value here. So this bisection is slow, we don't use it. In practice we quite often use something called the Newton's method or Newton-Rapson method and this is used for example by Stata, this is used by M plus for certain problems. It is used by some R packages for certain estimation problems and it's also one of the early algorithms and it's easy to understand. The idea of a Newton's method is that we start from an initial guess. So let's say that we guess from here we want to find the zero of this function x square minus four and we start from an initial guess called starting value and our starting value now is ten and we calculate the value of this function x square minus four at ten we get about somewhere about one hundred and we calculate or we draw the tangent line. We calculate the derivative and then we draw the tangent line here the tangent line tells us what is the direction of the curve at this particular point here. Then we move along the tangent line until we hit zero and that's our new estimate. So our new estimate is about five point something and we calculate the value of the function at five point something and we calculate the derivative we go along the tangent to zero and our new estimate is about three that's our new estimate where the zero is. We calculate the function value at zero we go along tangent we are now very close to zero we can see that the current estimate is two point one six two one six two point one six and the correct value if we apply algebra here is two we apply another round of iteration and now we are correct on three digits precision if we apply again we get even more precision but at this point we can declare that the function has converged. The idea is that we calculate the function value we calculate the derivative at the particular point of this curve and then we go along the tangent until we hit zero and this is mathematically how we proceed. So x is our initial guess x zero and then we calculate the function value at x zero we calculate the derivative of x zero and then we divide the function value with the derivative and we subtract that from the starting value that gives us the value at first iteration then we proceed until there are the x changes by so little that we declare that it's the change is no more meaningful and we declare convergence. So this will get indefinitely precise it will be two at arbitrary degree of precision typically if we say that if we have like six or eight digits or 16 digits that's gonna be enough. So we just decide that this is close enough so it doesn't matter that we don't get the exact value. So we never get an exact maximum but we get something that is very close to the maximum using these numerical techniques. Okay so this is how we find a zero so we calculate the function value we calculate the derivative of the function and then that gives us the direction and how far we go. When we maximize something we are not seeking the zero of the function fx instead we are looking to maximize looking up for a zero of the derivative so when we maximize we replace the function value with the first derivative and when we maximize we replace the derivative of derivative of the function with the derivative of the first derivative which is the second derivative and this is the simple if we have one parameter problem if we have two parameters that we estimate let's say that we have a normal distribution we estimate the mean and the standard deviation from the same data set using one run of numerical optimization then we have more than one derivative because there is one derivative for its parameter and we have also more than one second derivative so with multiple parameters this first derivative becomes the gradient vector and these second derivatives go into what we call the Hessian matrix so gradient vector includes derivatives of all parameters so it tells us how much the likelihood is going to change if we change every parameter by very small amount and then the Hessian matrix tells us how much the gradient is going to change if we go to some direction let's take a look at how this works but before that we need to understand that this Newton's technique is not a solution it's not a general solution so it can fail so the Newton's technique works well when all the derivatives exist and in this particular case the second derivative at that point doesn't exist so Newton's technique fails we go from an initial guess of about two we go up along tangent we go to minus 1.69 we go down then we go along tangent we get 2.32 then we go up then we go along tangent and we get further and further from the minimum or the zero which is here at zero so there are scenarios the Newton's method works well for some functions but it doesn't work well for all functions and we can just let it go forever and ever and it will just end up further and further with positive and negative values so it's possible to find this zero point here for example if we use the bisection method we can just look for calculate at random points until we find at least one negative and one positive value and then start using the bisection method we'll find that there are zero pointers here but Newton's method does not find it it's therefore it's possible that these optimization techniques fail and in that case the solution to the problem if it's purely a failure of the optimization technique then the solution to convergence problem is to switch to a different optimizer and for example STATA provides half a dozen different optimization techniques for some estimators for others there are less options but you can switch to a different technique if that solves the problem this is some of the things that I recommend as a first step for non-converges problems because changing to a new optimizer is something that is quick to do and if it solves the problem then that's problem solved and you can go on and address other problems but there are also reasons why optimization fails that are not related to the actual optimization algorithm before I go into those problems let's take a look at how optimization works in two-parameter case so this example that we had was estimating the mean assuming that the variance is one and the sample is three observations two, three and four what if we estimate both mean and standard deviation so we want to estimate two parameters now our likelihood function is a function of both the mean and the standard deviation so it's no longer just a function of the mean but it's a function that takes two parameters the mean and the standard deviation here so this shows the function values using a contour plot we can see that the function gets very very very small when the standard deviation goes to minus one and the maximum of the likelihood is here at mean equals zero and standard deviation of about 0.8 something how does computer find the estimate the computer finds the estimate by for example using the Newton's method so we have the initial guess let's say that we guess that the mean is that two standard deviation is one then computer goes way up here first and then it comes back down and after 136 iterations of Newton's algorithm it finds a solution so this is less than a second perhaps on a modern computer of course if we try this was a difficult problem for Newton's method because the gradient here is very steep and we can try different estimation algorithms or optimization algorithms for example one that I like to use when the default which is of the Newton doesn't work is BFGS and it gets us to the solution using 11 iterations so the idea is that we first calculate the likelihood this combination, then that combination then this combination, that combination then we get closer and closer to the true value or we can see that if we start from a different point then the path is going to be different if we start from here further so our starting value for the mean is zero instead of two zero is a lot further from three than two then BFGS also takes 138 iterations to go get to the starting point so if the algorithm converges then the speed of convergence depends mainly on two things what is the algorithm and if it's well suited to that problem but it depends more critically on the starting value so Newton's method works well when the function has continuous second derivatives it's a smooth function and then the starting value is close to the actual value so if we use Newton's method we start from here then we get even more iterations how do gradient that hasn't then play in this picture let's take a look at what the computer is actually doing so remember that in Newton's method we take the starting value we calculate the derivative at the starting value we divide that derivative with the second derivative so in multiple parameter optimization we divide the gradient with the Hessian matrix because we can't divide with matrices we multiply gradient with the inverse of Hessian and that gives us the next point to try so let's take a look at these numbers actually how it works so this is the same problem and we are using Newton's method we are starting from point where mean is zero standard deviation is one and log likelihood is minus 17 and this shows this small red dot here shows that that's our position these dash lines just indicate where the positions are and this is the likelihood function for the standard deviation calculated at mean fixed at zero so if we fix the mean at zero then estimating the standard deviation is simply finding a maximum of this function here this is the likelihood function where the standard deviation is fixed at one and we want to find the maximum likelihood estimate of the mean so this is equivalent to the one parameter estimation problem that I showed before and this is the actual normal distribution that we draw for the data we can see quite clearly that the normal distribution is off so all these observations are the right-hand side of the mean which is zero so perhaps we should move this normal distribution to the right to make the maximum larger as large as possible then we have the first derivative is the gradient vector so that's the derivative with respect to m and that's the derivative with respect to mean and standard deviation the derivative with respect to standard deviation is larger so we can see that the derivative is larger means that the curve here the likelihood curve is a lot steeper here and when we change S then the likelihood is going to increase a lot more than if we change the mean the same amount and that tells us that or tells the optimization algorithm that it should actually start optimizing the standard deviation more than the mean so the standard deviation needs to go up because the likelihood will increase if we go it up the mean needs to go right that will make the likelihood larger and then the computer will increase the mean a little increase the standard deviation a lot because the likelihood is steeper and it ends up here so we get to here and at that point when the standard deviation is at 27 then the likelihood doesn't really depend on the mean anymore because this is so wide that if we just switch it sideways a little bit it doesn't really make a difference so the computer then starts to find there are decreased standard deviation because that you can see here this point goes up so every time we increase standard deviation a little it goes up and up and up and this normal density that we fit to the data becomes more narrow and then we can see that the mean actually starts to matter so this starts to form a small peak and then the computer will find that there is actually a maximum in that relatively flat looking area as well and then it will hit following the tangent needle or tangent which the gradient vector gives it will hit the minimum or the maximum likelihood in this case so it found the maximum and we know that this is maximum likelihood for two reasons first of all the derivatives the gradient these are zeroes both so if we just look at the likelihood for this particular mean value it's flat here so we are on the top so if we go left or right the value will decrease and the standard deviation here the gradient is flat if we go left or right the value will decrease how do we know that it's a peak instead of being like a bottom or being like an S curve we know it because these second derivatives they are the diagonal elements of the Hessian matrix I'll talk more about Hessian another video are negative so when the second derivative is negative I know that the curve it curves down like that and then the maximum is that when the mean is when the derivative is at zero also there are second derivative with respect to both standard deviation and mean or the partial derivative rather is close to zero and that means that this is a nicely behaving optimization problem because the derivative of the mean does not depend strongly on the value of the standard deviation if we increase the standard deviation by a little or decrease by a little then the derivative of the mean will stay about the same and also if we increase the mean or decrease it by a little then the derivative of the standard deviation stays about the same so in an ideal case these diagonal elements of Hessian matrix so the second partial derivative they should be all negative the off diagonal elements should be close to zero and that's an easy problem for the computer so that's basically what you have in numerical optimization you have the gradients when you have found the maximum the gradient vector is all zeroes all derivatives are zeroes all second order partial derivatives the diagonal of the Hessian matrix are negative and the off diagonal elements are zero or at least not very large and then you know that the model has converged well and computers give you different tests so you don't generally have to inspect these matrices yourself when the computer declares convergence and there are no error messages typically it means that these conditions hold so how does this thing fail or why does this thing fail there are different reasons why the algorithm can possibly be non-convergent or produce a convergent solution that you shouldn't trust because there are some error messages first of all it's possible that the algorithm fails so Newton's technique which is often the default assumes that the starting value is reasonably close to the actual maximum likelihood and also it assumes that the function is smooth if your algorithm fails there are two solutions to this problem one is better starting values or cry different algorithm typically trying a different algorithm is a lot quicker than thinking about the starting values so personally if the normal technique fails I go with BFGS if it doesn't give me any estimates then I'll start doing other diagnostics then it's possible that the likelihood or the derivatives cannot be calculated and this can be for a couple of reasons it can be a lack of identification of the model it can be that our algorithm cannot be applied to a particular scenario so it's possible that we can still find a minimum or maximum of the likelihood by using a different optimization technique that does not use for example the Hessian matrix or the second derivatives or we can try better starting values because this problem can be a specific certain set of parameter values then it's possible that the computer implementation fails and this is different from the algorithmic failure and the computer implementation failure basically relates to purely computational problems so it can be that your likelihood or the derivatives are at some point they are so close to zero that the computer rounds them to be exactly zero because computers have finite precision and things can go wrong because of this rounding to zero so if you have a derivative that is rounds to zero then you don't know whether you go left or right when you want to increase, maximize the likelihood different algorithm bear starting values or even different software implementation can help in this stage so these computer implementation failures are something that developers of statistical software actually they tweak the algorithms to avoid these numerical problems for example sometimes you may multiply all the parameters by a constant or you might multiply the likelihood by a constant to make sure it doesn't go to round to zero then it's possible that you reach the maximum number of iterations so why we have maximum number of iterations is that sometimes the computer is simply unable to get any maximum likelihood estimates for it it can't find a maximum so we tell the computer to for example try thousand different values if it can't find any then declare that the model is not convergent the reason why there's a maximum number of iterations is that for some problems the maximum likelihood does not exist or it's not unique in which case if we don't limit the number of iterations the computer would just run forever and forever and never stop running and we don't want to have that so we limit the iterations you can also increase the limit yourself so what you can do is use a different algorithm better starting values or increase the iterations then it's possible that the maximum likelihood estimates do not exist I have a video talking about this in the context of logistic regression analysis this can happen for example when some of the parameters converge towards positive or negative infinity so a parameter can never reach infinity but you can always make it slightly larger and in that case you can just modify the model or collect more data because those are the two remedies it's possible that the model is not identified which means that there is no unique solution to the problem and in that case the only thing that you can do is to modify the model collecting more data does not help so if you are collecting just one variable and you say that the mean of that variable is a function of two parameters A and B you cannot solve or estimate A and B at the same time because you only have one mean and you cannot estimate two different quantities from one quantity the final reason why the algorithm can fail is empirical under identification and this basically relates to a scenario where some of the parameter values are so close to zero that the model becomes basically unidentified so it's as if a parameter wouldn't exist in the model and these all are something that you need lots of practice to be able to diagnose and you need to understand what the computer does when it tries to find the maximum because the only output that you get is the parameter values, the gradient and the Hessian and by interpreting these different things particularly interpreting the Hessian which I talked about which I discussed in another video you can make at least an informed guess on what the problem may be