 In this video I'll talk a bit more about the role of Hessian matrix in maximum likelihood estimation and particularly when we find the maximum of likelihood function using the Newton's technique. Also I'll talk a bit about how to interpret the Hessian matrix and what should you check in the Hessian matrix when you think that there may be a convergence problem in your model or if your model even doesn't converge. So the first thing that we need to address again is why should I apply research or care. The reason why you should care is that sometimes you get these messages from your statistical software. This is data and about Hessian you may get an error message telling that Hessian is not negative semi-definite to understand what that means will help you to fix the problem otherwise it's just a trial and error until you figure out something that works without really understanding what was the problem in the first place. Also it's possible that you get a solution but you didn't actually solve the original problem but you just happen to find a way that makes the problem disappear without solving it. Then it's also possible that you get this information about these estimation challenges the computer faces but in the end there is nothing problematic going on and to understand in which scenarios you should be worried in which scenarios you can just look at estimates and not worry about that you need to understand a bit about what the Hessian matrix is and what it tells us. So when we maximize the likelihood we typically do that with numerical optimization we try different values at this likelihood function here we try different values of x until we find that the value of the function is maximized when we set x to 3. In practice what we do is that we don't maximize this by a computer instead we do it by looking at where the zero point of this first derivative is or we have a multiple parameter estimation problem such that we estimate mean and standard deviation from the same data then we have two derivatives that are in the gradient vector and we look for gradient vector where all the elements of that vector are zeroes and we do that with the help of the second derivative on the purple line here so we have multiple parameters of the estimate then the second derivatives go into Hessian matrix so if there are two parameters then the Hessian is two by two matrix. Quite often we apply a Newton's technique and when we maximize the likelihood we use the first derivative and the second derivative and or the gradient vector and the Hessian matrix when we have multiple parameter problems. How exactly this process works I explain in another video but now we focus on the Hessian matrix and what does it mean for the Hessian matrix to be a convex or concave or not concave or not positive semi or negative semi definite and so on. Let's take a look at what convex and concave functions are and what does it mean for a function to be strictly convex or strictly concave. So a convex function is one where the derivative where the second derivative is always positive or zero. So when the second derivative is zero then it means that the function is a straight line at some point. So we have a straight line here this part and then this part curves up or if we travel along the line we always turn slightly to the left and it's convex because of the straight part. This is strictly convex so it means that we always when we travel along the line we always turn slightly to the left so we go first turn left until we have zero and then we go up again. Then concave is the same except we have a line and then we go down and strictly concave is that we always curve right. If we turn right and then turn left or turn left and then turn right then that is neither convex nor concave and this is the last example shows a function like that. So do we need to care about this non, if function is not concave or not? Let's take a look at this function. So this is one parameter estimation problem. We try to find the maximum likelihood estimate and this is neither concave nor convex and it's not concave in this part here but it is actually concave here at the maximum. So is it a problem that this function here first curves on when we go from left to right it first curves left and then right so it first curves up and then down. Well it's not a problem as long as our computational algorithm can get around this place because this here is the maximum it's concave. It means that whether we go left or whether we go right the tangent is zero, the derivative is zero because it's flat in the top and it curves right down and if we go right because the second derivative is negative it curves down if we go left because the second derivative is negative. So it's important that here when we are at maximum then we always curve down when we start going sideways. For example here we could have a point where the derivative is or tangent is flat and because the second derivative is non negative it means that we are not at the minimum, sorry maximum. So whenever we are at convergence we have this thing we have found a maximum likelihood estimate it's important that at that point the gradient vector is zero and the function is concave because it guarantees that we which ever direction we go the likelihood will decrease. So returning back to the previous example with this data screenshots, the screenshot on the left showed problems in the last iteration then the screenshot on the right showed an example of this kind of function so you have a region that is not concave but then you iterate and you climb up and then at maximum the function becomes concave so the fact that it's not concave here does not mean that this would not be a valid maximum likelihood estimate. So generally the last step or some of the steps closer to the last step in the optimization algorithm are more important than something that happens far from the assumed maximum of the function. If we look at actually how the algorithm works for two parameter estimation problem we can see that we start from a starting value and then the computer adjusts those values mean and standard deviation to find the maximum of the likelihood I explain this process in more detail in another video but here we can see that the gradient is close to zero it goes to exactly to zero when we are at the maximum and these Hessian matrix the second order parser derivative with respect to m and the parser derivative with respect to m and s and then the second order parser derivative with respect to x these second order parser derivatives go to negative values and this goes to zero and that indicates that whichever direction we go if we increase s or we increase m we decrease s or we decrease m the derivative will always be negative which means that we go down so the second order parser derivative tells to which direction the first order parser derivative of a variable will change when we go to one direction now let's take a closer look at what the Hessian matrix is and how we use Hessian matrix in this Newton's method of Timises if you have studied some matrix algebra it will be useful here but I try to walk you through this even if you haven't so in Newton's method in matrix form looks like that so we have x is a vector of whatever the current values are the starting values initially for m and s in this case and then h is the Hessian matrix we invert the Hessian matrix which is kind of similar to dividing something with the matrix and then we have this this is the gradient here and we can write out the content of the gradient and the content of the Hessian it looks like that so we have here I'll use some colors to make it more clear what the different elements are the Hessian matrix and we have first the diagonal this is the second order parser derivative with respect to m second order parser derivative with respect to s and then we have on the op diagonal elements this is a symmetric square matrix so these two elements are the same it is the parser derivative with respect to s and m and then we have the first order parser derivatives with respect to f and with respect to s and what is the meaning of these terms well this parser derivative of f with respect to m tells us that from the current point if we go increase m by a very small amount how much will the value of f which in this case is the maximum likelihood function will change if this derivative is positive then it means that we can in most cases increase the value of the likelihood by increasing the mean the m a little so there are some exceptions related to the second parser derivatives generally if m and s are independent then if we increase m by a little and this derivative is positive it means that the value of f will go up and then we know that okay we should increase m by a little to increase the likelihood because we want to maximize the likelihood then these second order parser derivatives tell us that how much is the derivative going to change the first order parser derivative of m how much is going to change when we change m by a little so if the derivative is originally 0 so it's a flat and the second order parser derivative is negative then it means that the derivative starts to go down immediately as we increase m a little so it tells how much the surface is curving down or curving up and generally at convergence these second order parser derivatives here the diagonal elements should all be negative then the off diagonal elements here how much the parser derivative of f with respect to m and s they will tell us how much the derivative of f with respect to m changes if we change s by a small amount so it kind of tells how the curvature of f with respect to m is related to the value of the change of s and I'll show graphically what this means when we start doing this this method we need to calculate the inverse and the inverse of 2 by 2 matrix is calculated like that so we take negatives of these we flip the diagonal and then we multiply everything by 1 minus the determinant of the matrix so that's just math and then we can do this matrix product here so this is 4 by 4 matrix 2 by 2 matrix and this is 2 by 1 matrix we multiply them together 1 matrix that is 2 by 1 so this is 2 by 1 matrix so this is the first element the second order parser derivative of s multiplied by derivative of m minus they are second order parser derivative with respect to m and s multiplied by the derivative of f with respect to s and then the second element is similar except so it's s and m the other way around so what do these are two parts of the equation tell us well the first part is simply a distance because we multiply there are how much m changes and how much s changes by the same constant so this is the same value for both with the scalar and we multiply both elements here with the same scalar it just tells us the distance and this second matrix which is 2 by 1 so 2 rows 1 column it tells us about the direction and distance and let's take a look at this matrix in more detail and how the direction of where we go is determined so whenever we have a plane where we have like s on y axis m on x axis and we want to find the maximum we need to decide which direction we go on that plane and how far and we take these steps until we find the maximum generally the distance of the step will decrease as the algorithm estimation algorithm gets more and more closer to the maximum or convergence so direction and distance are important let's take a look at the direction first because it allows us perhaps to understand a bit about the role of the second derivatives so normally the first derivative tells you the direction if second derivatives are close to zero if the first derivative is positive it tells us that increase this parameter to increase the likelihood if it's negative it tells that this parameter should decrease to increase the likelihood and then we either increase or decrease so we follow the first derivatives in the gradient vector to determine the direction and how is that how does the second derivatives how do they influence this how we choose the direction where we go on the plane of s and n and in more general cases we may have a space of four or 20 or 50 dimensions and we need to choose a direction from where do we go from the current set of estimates so here the first element of the matrix is the direction for the mean how far we go how much we adjust mean and to which direction and the second one is the direction for sd and how much we adjust sd compared to the mean and we can take a look at this and the first thing that we note that at convergence when we have found the maximum of the likelihood then all first order derivatives are typically zero then this simply zero multiplied by something minus zero multiplied by something equals zero so we are not going anywhere so when we are at convergence the Newton's algorithm is not going anywhere so we declare that that's the final result then the idea of this derivative here is that if f is steep with respect to m we adjust m more so if a small adjustment leads to a large increase in the likelihood function then we should adjust m more than adjust s so if adjusting m by one unit increases the likelihood by let's say 10 and increasing s by one unit increases the likelihood by 2 then we should go more toward the direction of m than the direction of s because that is where that gives us the most increase of the likelihood if f curves heavily with respect to s adjust m more so that's the meaning of the green part here the idea is that if m and s both have the same value in the gradient so let's say that the derivative of f is one and the derivative of s is one so it doesn't imply that we adjust s and m equally much but if the second order derivative which is typically negative if the function is concave is large it means that the derivative of s is going to decrease quite rapidly when we increase s and that implies that we should increase m more than increase s because the gains in adjusting s more than m are small and the same thing here if f is t with respect to s adjust m less so the idea is that if we get a larger increase in likelihood by going to the direction of s than going to the direction of m then perhaps we should go more toward the direction of s than the direction of m and the same thing here if the slope of f with respect to m decreases with s adjust m less so the idea is that if we adjust s and m at the same time then we know that this high large second order partial derivative will make the element of the gradient that corresponds to m to decrease quite rapidly we shouldn't go too far along m so this is the meaning of the equation in the Newton's method and how we use the second order partial derivatives or the Hessian matrix in Newton's technique let's take a look at what convex and concave functions are in a two-parameter case so in a two-parameter case a function is convex if all its derivatives are positive or zero and this element here which is the determinant is also positive so if you see problems related to determinants in your statistical software output it typically refers to the determinant of the Hessian matrix so if the determinant is let's say zero or something zero then things basically fail in Newton's method because you run into division of zero and in strictly convex functions all of these are positive always positive in concave functions they are negative this is negative or zero and strictly concave functions these are always negative and this is always positive so what is the meaning of this how does it relate to these errors about positive semi-definite negative semi-definite negative if the function is strictly concave so it always curves down or is flat then the Hessian matrix is negative semi-definite so the Hessian matrix tells us about the curvature of the function and if it's always straight or curves down then it's easy for the optimizer because you can adjust using Newton's technique and then we have negative semi-definite the computer tells us that it's not negative semi-definite so that indicates that it's challenging for the optimization algorithm if you still get a solution where the Hessian matrix is at the end negative semi-definite you're probably going to be fine what is the meaning of the definition of positive semi-definite positive with negative semi-definite and negative with definite is defined by this kind of matrix equation so we have a vector z which has the same number of rows than the Hessian matrix and if we multiply the Hessian matrix with an arbitrary vector then the result is going to be always positive for positive definite matrices always negative for negative definite matrices so what is the meaning of this particle equation here the meaning is that z basically tells us to which direction we go and how far so the idea of a vector if we draw it in a two dimensional plane or three dimensional space is that it gives us direction and distance so it's basically one step and in maximum likelihood estimation we want the likelihood value to increase and when we are at maximum we want to be at the position that whichever direction we go and regardless how far we go the value of the likelihood function will always decrease because otherwise we are not at the maximum so at the maximum where we are at the maximum or what we think is the maximum then this if the Hessian matrix is negative semi-definite there and the gradient vector is zero and that means that whichever direction we go it's always smaller therefore we know that that point where we are is actually the maximum of the likelihood if we look at the equation itself the Hessian tells how much the derivatives are going to change if we change all the estimate by a little and when we multiply the Hessian by the z once we get basically the gradient vector so assuming that the gradient is initially zero which it is at the convergence if we multiply the Hessian matrix by z we get the gradient if we multiply gradient by z again then we get the change of the likelihood when we go the direction and distance indicated by z so that's the idea so the idea here is that whichever direction we go there are gradient will be negative or negative elements and therefore the change will be always negative so the likelihood will decrease whichever direction we go and that's the consequence of the Hessian being negative semi-definite so that's what you see here Hessian is not semi-definite and that's a problem because theta concluded as well this is something that the gradient is zero so that's a possible maximum but the Hessian is not negative semi-definite so it's not guaranteed to be maximum and then it quits because it doesn't know where to go let's take a look graphically what it means for the Hessian or the curvature of the likelihood to be a negative semi-definite or non-negative semi-definite so this is a fairly simple case assume that the Hessian is constant so we have the Hessian contains three parameters so in the diagonal we have the second order derivatives with respect to m and s and of the diagonal we have the partial derivatives with respect to s and m so we have three values the diagonal elements are symmetric and if the off-diagonal element is zero always then the derivative of m does not depend on the derivative of s ns here don't have any particular meaning we just know that the maximum or the point where we are looking at is always at m equals zero and s equals zero so we can see here that the maximum is here where m and s are both zero and why do we know that that's the maximum we know because the partial derivatives are negative so we know that if we go from this zero it curves down either way we go if we have this point of x equals zero it curves down either way we go and the curvature of s and m are independent of one another because the second order derivative is zero so what happens if one of the derivatives is different value than the other so let's assume that the first order the second order partial derivative of s is minus zero point five and the second order partial derivative with respect to m is minus one it just tells us that the curvature of m here is a lot stronger than the curvature of s so s is more flat m is more steep so it's more steeper curve so it doesn't really matter what's the magnitude of those values this is still maximum as long as the diagonal elements of the Hessian matrix which contain these two partial derivatives are both negative if we have a case of non-negative definite matrix that can happen for two reasons the simpler to understand it that some elements on the diagonal are non-negative so they are off diagonal elements are now zero but the elements on the diagonal one of them is zero particularly there are partial derivative with respect to s the second order partial derivative is zero what that means is that it's flat so the surface is flat and if the gradient is zero at that point then the value of the function here will be the same regardless of which value of s we apply so all s's are equally likely to produce maximum value here and that kind of model is not identified by definition because if you have multiple different sets of parameter values that produce the same value of the likelihood then that's the definition of under identification in practice this means that if you have these negative or zero diagonal elements our data does not allow us to really say anything about s or at least it doesn't allow us to say what's the one specific best value there's another case that can happen it is possible that at some point one of the second order partial derivatives is positive and this is rarely the case when you reach a convergence but it can happen during your estimation procedure and we call this the saddle point and this is the saddle points are problematic for some estimates and algorithms for example Newton's technique wouldn't be able to work with this kind of saddle point necessarily but you can just try other techniques the reason why it doesn't really work is that you can't determine whether you are if you want to increase m you cannot increase the likelihood you cannot determine whether you should increase s or decrease s because either way you go you will have the same amount of increase so you can't say that going to the right is better than going to the left and in some cases statistical software manufacturers have developed different workarounds for this kind of problems for example you can try going right a bit then you can try going left a bit and see where you end up but just the textbook implementation of Newton's method wouldn't be able to determine whether to increase or decrease s and it wouldn't be a convergence point either because this is not a maximum you can increase the likelihood by going positive or negative from the s is 0 alright so for the Hessian to be negative definite and for the function to be concave diagonal elements should be negative so that's what you want to have in the convergence point so that guarantees that if you adjust s or m a little then the likelihood will be smaller which means that we are possibly at the maximum so we can't go anywhere to improve there's also this other condition that relates to the off diagonal elements and let's take a look at that again if one of these second order partial derivatives are 0 that indicates a possible identification problem let's take a look at the role of the off diagonal elements of the Hessian matrix so here's an example where both of these second order partial derivative with respect to s and m are both minus 1 but this partial derivative with respect to s and m is positive it's less than 1 but it's positive what that means is that if we go let's say we take this point and we go right so we increase m that will decrease the likelihood but it will also make the partial derivative of the function with respect to s positive so when we go up along x we actually increase the likelihood so the idea of this second order partial derivative with respect to m and s is that the derivative of m changes when s changes or the derivative of s changes when m changes so originally the derivatives are 0 in this example so we are looking at a convergence point and when we go right then actually we decrease the likelihood because m has a negative second order partial derivative but when we go up we actually increase the likelihood because the gradient or the derivative of f with respect to s actually became positive because of this second order derivative with respect to s and m and we can see here that this is still a valid maximum point so here it's just that this surface curves but it doesn't curve evenly and it doesn't matter what's the sign of the diagonal what matters is the magnitude so 0.5 here is not sufficient to change the derivative of one of the parameters to be positive when one of the parameters is changed this is a non-negative definite so the idea here is that all these elements are minus ones so the diagonal elements are minus ones we know the second order derivative with respect to s and with respect to m are both negative so it curves if we just adjust one of the variables but if we adjust both of the variables at the same time we can see that this is actually a ridge here so this is a ridge here so this is a combination of m and s values that produce the same likelihood so we can always increase s a little and have the same value of the likelihood function if we at the same time decrease m a little we can see it here so this is a ridge here so all these combinations of parameters are equally likely the maximum is not unique and this model is not identified because we have multiple different combinations of parameters parameter values that produce the same likelihood value so the idea is that if we go to the right then the likelihood will decrease because the second order partial derivative of is minus one here if we then go down the likelihood will increase and it will be back at the original value because of this second order partial derivative with respect to m and s cancelling the effect of this here so the idea here is that the absolute value of the octagonal elements is large enough to make the function flat or curve up so normally when the absolute value of the octagonal elements are the same then if we look the function from the top it kind of looks like a parabola that is spinning around but if these octagonal elements are large then some parts of the parabola are lifted up and it can become straight or it can even become curving up so in this case when this kind of identification problem could occur is that if we estimate that the mean x is s plus m we assume standard deviation of x to be one we only have one sample mean we try to estimate two quantities you can't estimate two quantities from one quantity the model is not identified the same thing here as before so the sign of the second order partial derivative or the off diagonal elements of the Hessian it doesn't make a difference, the magnitude does it simply determines to what direction the ridge goes so here we have a ridge that is going from negative m, negative s to positive m, positive s and in previous case we had a negative m and a positive s so that's the 3D plot of the same problem in some rare cases you can also have a saddle point so it's possible that if in absolute terms if this is large enough then it's actually the surface curves up at some point and here estimation algorithm would fail because it doesn't know whether it goes left or right because both increase the likelihood by the same amount so to recap there are convex and concave functions we want to have this criterion here to be always positive so the idea is that if we multiply the second order partial derivatives which should be negative together then that should be always less than the product of diagonal elements this is of course for 2x2 matrix for larger matrices the math is more complicated but the basic idea is the same the basic idea is that these off diagonal elements should be small in absolute value compared to the diagonal elements and if this doesn't hold if the off diagonal elements are large enough to make the surface flat then we have a potential identification problem so what should applied researchers do about it so what should you do when you inspect the Hessian and what things, when you should do that and what you should be looking at it's important to check the last iteration so if the last iteration indicates that your Hessian was not negative semi-definite or the function is not concave these are the same thing just inspect it from different angle but they are the non-negative semi-definite Hessian so if the function is not concave and not concave function always has this non-negative semi-definite Hessian matrix so you check the last iteration because it's possible that the optimizer had some difficulties early on in the optimization process but then it goes through one or two iterations without any problems if the final iteration is you have negative semi-definite Hessian then that's at least the local maximum then if you have problems print out the gradient Hessian and gradient should be all zeros and in Hessian inspect a couple of things then in Hessian all the diagonal elements of Hessian should be negative so that means that it curves down if you go along any of the variables individually and then you should be looking at are any of the off diagonal elements large in absolute value compared to the diagonal elements if there are off diagonal elements that are zero it indicates that the model is not identified for that one particular parameter if there are some off diagonal elements that are large in absolute value then it indicates that the model may not be identified for a pair of estimates and then you can think about those two estimates and think about how the model works and that will probably help you in figuring out what the identification problem is one final thing is that what you should note is that some statistical software don't maximize the log likelihood but instead they minimize the minus log likelihood and in that case what you have here about concave you would have a message about convex and you would have a message that Hessian is not positive semi-definite so if you sometimes have the software telling that Hessian is not positive semi-definite then one reason for that could be that the library that the software is using is actually for minimization and then the computer has turned a maximization problem or minimization problem by just multiplying the likelihood by minus one but this is something to consider when you look at your own software output