 Hessen matrix on suuri important element of numerical optimization techniques that we use to calculate maximum likelihood estimates. Hessen matrix is something that can be also useful to inspect when you have convergence problems. Let's take a look at what is Hessen matrix and why it can be useful for diagnosing model, non-convergence and particle and model identification issues. The first time an applied researcher sees the term Hessen is probably in this context. So you might have your statistical software printing out warnings related to Hessen, or you might have some output from the maximum likelihood estimation, where in addition to printing out the likelihood, the computer prints out some lines about Hessen matrix something. So let's take a look at what the Hessen matrix is. I have another video where I explain the Hessen matrix in more detail, but the idea is simply this. So understanding the Hessen matrix is useful to start from understanding what is convergence. And normally when we do maximum likelihood estimation, it is kind of like climbing on top of a mountain. And we must have a rule to determine whether we are on the top. And one commonly used rule is that we are on the top when the ground below our feet is flat and it curves down to all directions. So the point here is that we cannot take a step to any direction and go up anymore, we would always go down. And the slope, the flatness is stored in the gradient vector, which contains the first derivatives of the problem. So slopes. And the curvature is stored in the Hessen matrix, which contain the second part of the derivatives of the problem. When we visualize a Hessen matrix, it's useful to take a look at different surfaces. So this would be a surface, we have a maximum likelihood surface here. And this is for an estimation problem where we are trying to estimate the standard deviation and mean of a normal distribution for some sample. And that's the standard deviation and this is the mean. And we can see that the maximum likelihood is run here when standard deviation and mean are at certain values. So this is where we are trying to find the maximum, the values of standard deviation and mean for this normal distribution that makes their likelihood as large as possible. When we look at the same problem from the top down perspective like a map, we can see that there are these contour lines. So the map points that there is just one peak when we travel across the contour lines, we go up and then we are eventually on the peak. And this is a well behaved problem. We would call that the Hessen for this problem is negative definite. It simply means that when we are on the top here, then it curves down to all directions. Another example would be a Hessen matrix that is not negative definite. So we might have a problem like this. So the mean is identified, but the standard deviation is not. For example, if there is no variance in the data, then the standard deviation couldn't be calculated. If a sample size is like one, we might calculate mean but not standard deviation. So we have this ridge here. So all values of s are equally likely. We can only estimate this m here. If we look at it from the top, then the contour lines on a map would be straight. We can go up here, but we can see that there's this ridge here along which all values of m and s are equally likely to produce the observed data. So briefly, the Hessen matrix tells us to which direction this plane or this surface curves. Now let's go to a practical example. I'm going to be estimating this model here and then showing the Hessen matrix, how we can use the Hessen matrix to understand why this is not identified. We have these two factors f1 and f2. We estimate a model where f1 is regressed on f2, f2 is regressed on f1 and each factor has three indicators. The decrease of freedom are positive, but this model is not identified because we have just one latent variable correlation. And as a general rule, you can't estimate two things, two regression paths from one correlation. So estimating these kind of reciprocal paths generally requires the use of instrumental variables, which are not available in this example. Our data comes from UCLA. This is an example of how you can calculate that exploratory factor analysis using converter factor analysis. But we'll be using this data set just to run a normal two-factor converter factor analysis or CFMR. And this is the status index that I've been using. I will use, it is a bit complicated, but I'll walk you through line by line. What we'll start first is that we will estimate this model from a hundred different set of random starting values. So we have this R normal here indicating that we have a random starting value for these regression coefficients. The first set of estimates is here. And importantly, there are no warnings here. So everything looks okay, but it is not identified as I explained already. So you can't always trust your software's identification checks. They are not bulletproof. They work most of the time, but sometimes problems like this go undetected. All right. So, but if we know, if you know where to look, you might see that these standard errors are very large. And that indicates that there is a problem with these two parameters, two regression paths. Generally, if you have a linear model, then identification problems almost always involve a trade-off between two parameters. So if one parameter increases, another one must go down. So it is quite often a trade-off. And if you can fix identification of one parameter, then the identification of another one will also be established. If we compare the first seven models estimated from these random starting values, we can see that all the factor loadings, all the error variances for the indicators are the same. So these are identified. But what we note is that there are regression paths between these two latent variables. They vary from depending on which starting values we apply. And this is an indication of model non-identification. And we also note that the estimates are negatively correlated. So whenever the regression path from F1 to F2 is large, then the regression path from F2 to F1 is a large negative number. If one is zero, then the other one is also close to zero. So this is an interesting feature. They are negatively correlated. To understand more about how these estimates behave together, we'll save the estimates in the file and then we'll do a scatter plot of the two estimates. And so here's the scatter plot. We can see that there are a couple of cases where both estimates are very large in the ballpark of 1000. If we take, leave those out, we can see that, okay, so there are a couple of where the estimates are very small in the minus tens. If we take those out, we can see that except for those four outlying estimates, this is what the estimates look like. So we have two estimates that are perfectly correlated for a sub-sample. So these are perfectly correlated and they all happen to have also the same likelihood value. So this is the largest likelihood value that we can get using these data and all these other likelihoods are smaller, more negative value. So in that case, the computer has not found the ridge here. If we overlay our map of this ridge line here, we might see that the Hessian looks like that. So there's a ridge. This is not negative definite. And then these are not on the peak. So they are not on the ridge, they are smaller values than the values on the peak. But the values on the peak are equally likely. So how would we then know, based on looking at a single set of estimates, what the problem is? If we repeat the estimates from multiple different starting values and then we plot the estimates, we can get this kind of nice plot that illustrates the problem that there is a trade-off between these parameters. But this is a bit of work to do. We can actually do it more easily. So let's take a look at these estimates again. So based on these estimates, how would we get more information, which parameters are not identified? We can look at the standard errors, but let's say that if there are like six parameters with large standard errors, we wouldn't know which combination of those parameters is the problem. We are typically in words just two parameters. What we can do is that we can print out the covariance covariance matrix of the estimates, math list EV, so that's the variance covariance matrix. And we can say that there are regression paths from F1 to F2 and F2 to F1 are almost perfectly negatively correlated. So this is in the covariance matrix, but when the covariance is in the same ballpark as are the variances, then we also know that the correlation must be very close to plus 1 or minus 1. And indeed if we convert these to correlations, we will see that there is almost perfect negative correlation between the two estimates, which we saw in this plot here. So this is almost perfect or perfect negative correlation in repeated samples when we estimate it, it's almost perfect. So you can identify or you can spot where the identification problem is by printing out the variance covariance matrix of the estimates and then check if any estimates are almost perfectly correlated either positively or negatively. If a pair of estimates is perfectly correlated, then both of those estimates can be estimated from the same data. So you have to adjust the model or if it's empirical under identification case, which is sample specific problem, then collect more data. Sometimes the problem is so severe that you don't get estimates. So if you don't have estimates, then you can't calculate this estimate variance covariance matrix. So what do you do then? Well, what you can do is print out the gradient vector and the Hessian matrix. And this is the printout. It looks pretty intimidating for a beginner, but we have the gradient here. The gradient simply contains the slope. So it's the ground flat under our feet when we are at the presumed maximum. All these values should be close to zero. If they are not close to zero, then you have a problem with that specific parameter. Generally, they are always when the computer declares convergence or when the computer stops estimating, they are pretty much always close to zero. So finding problems here is not that common. Then we have the Hessian matrix, which tells us something about the curvature. And we would interpret this the same way like we interpret the variance covariance matrix of estimates. So we would be looking at values of off diagonal that are large in absolute value compared to the diagonal elements. So the diagonal elements are always negative numbers, or almost always a non-negative number indicates an identification problem that is specific to a parameter. But most of the time you look at the off diagonal elements, do any large values, and we have a large value here. So the value is comparable to these diagonal elements. This value here is comparable to these two diagonal elements. That indicates that there is a problem involving that specific parameter. I have a video about Hessian matrix where I talk about this interpretation in a lot more detail. But this is just a demonstration that you can use Hessian matrix and variance covariance matrix of the estimates to identify if there is this kind of linear dependence between two parameter estimates which makes the model not identified. When we take a look at the bigger picture of convergence problems, the Hessian matrix and the variance covariance matrix of estimates is useful for diagnosing these identification problems. It is not that useful for troubleshooting problems related to computational issues for which starting values is more useful to some computational issues might be diagnosed with Hessian matrix, but typically the kind of problems that you find with this technique is identification problems that you can only adjust by adjusting your model or by collecting more data.