 Numerikalintekraation techniques are sometimes required for calculating maximal likelihood estimates of statistical models. This is a bit of a technical topic, but there are a couple of practical reasons why understanding what numerical integration is is useful for applied researcher. Also if you're a really advanced applied researcher, you might sometimes find that your statistical software does not have an estimation routine for the kind of model that you're interested in and you might want to program your own estimator, in that case understanding numerical integration becomes useful. Let's take a look at a couple of examples of why an applied researcher should care. The first time I encountered numerical integration was in the context of non-linear structural equation models. Here we are estimating a fairly simple model, we have four latent variables, we have a couple of indicators each and we estimate an order of profit model. It takes forever to estimate, so this runs for maybe hours on my work computer. That's one way you might encounter numerical integration. Another less common way that you might encounter numerical integration is that you get weird results. So you are running a statistical model and then you reorder your data or you restart your computer, reload the data set and you forget to order the data. And from the same data you get two different sets of results depending on how the data are ordered. This is because of imprecision of certain numerical integration techniques. Let's take a look at why we use numerical integration and then what numerical integration does and then how we can make numerical integration work a bit faster in some scenarios. So normally when we estimate latent variable models, let's use the factranos model here as an example, but this applies to all latent variable models including random effects in multiple models. So normally how we estimate this factranos model we do something called a model implied covariance matrix and because this is a just identified model we would estimate this by setting this model implied covariance to be equal to the opposite covariance and then just solve the equations. If we have an over identified model then the estimation criterion is that we adjust the model parameters so that we have the implied covariance matrix and the absolute covariance matrix to be as close as one another as possible and the closeness is measured using the maximum likelihood criterion. Okay this works really well and it's quick and it's useful technique for most applications but it pices some boundary conditions so it doesn't always work. The first obvious thing is that if we have very many option variables then using this technique because impractical. Let's say we have a thousand indicators in our factor model then this wouldn't really work because 1000 by 1000 covariance matrix takes a lot of memory and a lot of computation time to do the matrix calculations. This might sound something that you never face but it's actually fairly common. So consider multilevel modeling and consider nested data. So if you have industry effect in multilevel models you might have something like 20 years panel so you have 20 observations for each company and then you might have 50 companies in each industry so you have 1000 observations in each industry and to estimate an industry effect which you don't observe you would have to fit a latent variable model and you might calculate this kind of huge correlation matrix and get the covariance between all the observations that would be your industry effect but that is not feasible because the size of the matrix so we need to have different techniques. Another boundary condition is nonlinear models. So let's assume that you have instead of linear model you have a probit model for these indicators and we can no longer use covariances because covariances quantify linear relationships and this is a nonlinear model. So for nonlinear models we need to use alternate techniques. We can no longer rely on covariances because they are fundamentally linear. The third case is interactions. We have a latent variable a here and a here is an observed variable but we can't interact. We can't multiply those variables together because we don't have values for the a. It's latent and we can't use covariances either because now we have three variables. We have m, a1 and a latent relationship, three variable relationship and a covariance is always between two variables. So covariances are linear relations with two variables and this could not be estimated from a covariance matrix. Let's take another example of how the numerical integration works and how it affects the results. So I'm estimating the same model using state as SCM here and then we are using state as GCM generalized structural mechanism modeling. We estimate the same model so we have a 10,000 observations that's a large sample size just to make a point and one factor, three indicators, it's just identified the SCM is very easy to estimate but there are a couple of interesting things that we'll find. I'm also using timer so we can compare how long it takes to run the SCM model versus the GCM estimation and the SCM results are here so we have the useful stuff we have the estimates, we have the high score test and it's zero because this is just an identified model and when we take a look at the GCM results we can see that the estimates and the standard errors they are the same and the likelihoods are the same but there are a couple of interesting things that we note. When we look at the timers here we note that the GCM is about 10 times slower than the normal SCM for this simple model and we also note that there is no chi-score statistic for the GCM. The chi-score statistic is not very useful in this case because the model is not, it's saturated, it can be tested but nevertheless we don't get the chi-score otherwise the results are identical. So what explains? If we take a look at how these models are estimated these two different commands actually use very different ways to calculate the likelihood. So when we look at the SCM documentation the equations part of the documentation the technical explanation tells us that we have the parameter vector and then we make that thing low likelihood as large as possible and the low likelihood involves the opposite covariance matrix it involves something that looks like that so that's the model implied covariance matrix and we make them as close as one other as possible. Then when we take a look at the GCM documentation the same section it shows something quite different so there is this integration symbol here and there is no covariance matrix of the data so what we are actually doing is that we are no longer comparing the absurd covariances against the implied covariances and calculating the likelihood of getting one sample covariance from an estimated population covariance matrix but instead we are looking at observation level likelihoods. So we are calculating the likelihood of each observation or cluster of observations depending on the model and then those likelihoods are multiplied together and that gives us the full likelihood. So how does it actually work? How do observation level likelihoods work? If we take a look at this data so we have four groups of four data points each and the ID variable indicates the group and our regression model we want to regress the Y the dependent variable on the ID and X which we just calculate normal regression analysis using maximum likelihood estimation we will calculate the fitted value here so it is simply the sum of ID and X using the regression code which is as weights and then we would take a look then we would estimate how likely we are going to get this Y of 0.6 for example from a normal distribution with mean of 2 and variance of 0.7 and the log likelihood for that observation would be minus 2.2 so this is fairly simple if you know how to do it so we calculate the fitted value we estimate the variance of the error term and then we check how likely we would be to get the observation that we just got from a normal distribution centered at the fitted value having the variance of the error term we multiply all those likelihoods or we sum the log likelihoods together and that gives us the total likelihood or the total log likelihood depending on working on likelihoods or logs Now problems begin when we don't observe the ID so we assume that there is some kind of group-level effect but we don't know, we don't have values for that group-level effect that might be unobstortated genetic in econometrics or it might be a latent variable factor in factor analysis but we don't have values for the ID so what do we do? well we can make an assumption that the ID is normally distributed, we pretty much always assume that continuous unobstortated or latent variables are normal because that simplifies the calculations and we need to make some kind of assumption of the distribution anyway to calculate the maximum likelihood estimates we can also work with categorical latent variables but for now let's just focus on the unnormal case let's take a closer look at now the observation level likelihood so how would we calculate this log likelihood for the first observation if we don't know the value of the cluster or group-level effect ID or the value so what we actually do here is that this is from the stated documentation is that we calculate the likelihood of an observation for a specific set of latent variable values u and so we assume that the latent variable has a specific value let's say 0 and we calculate likelihood and then we have the probability density of a set of latent variable values u so we have the probability density of the latent variable we have the likelihood calculated as a specific value and we multiply them together so what's the point? the point here is that we estimate the model we calculate the likelihood by calculating the likelihood of the possible value of the unobsured variable u here and take the weighted average so if we have a latent variable which has an estimated standard deviation of 1 an estimated mean of 0 or typically the mean is fixed at 0 then we would give the value calculated at latent variables value 0 a lot more weight than for a value that is calculated the likelihood that is calculated the latent variable value of minus 4 because that would be very unlikely from a standard normal distribution in practice we don't calculate every possible value because on a normal distribution it would take forever because there are infinite different values on normal distribution what we do instead is that we apply numerical integration techniques now we get to the question of what is numerical integration you might remember from high school that the idea of what integration does is that it calculates the area under the curve and numerical integration is a way to numerically approximate the area under the curve so this is normal distribution we know that this is a because it's probability density the area is 1 by definition but we can still use it as an example so one way to calculate estimate area under the curve is to approximate this with a histogram so we would calculate the value the height of this curve at 4 points x equals minus 4 x equals 2 x equals 0 x equals 2 and x equals 4 and we weight them so each has an equal weight and we calculate the function value the height of the curve but function value multiplied by the weights and taking a sum gives us the area so it is off by 1.4% but pretty close even with this good approximation these x values are called integration points so we have here 5 integration points we can make the integral more precise by increasing the number of points for example with 10 points we would have this kind of histogram and now we get a very precise estimate already we can add even more and now we can see that we would with 100 points we would be pretty well equipped to approximate also almost any kind of curve normally when you start learning about numerical integration we use equal weights like the bars here are equally spaced each has a weight of 0.1 and previously we had a weight of 1 a weight of 2 but the general strategy is that we don't use equal weights instead we strategically choose a couple of x values 5, 7, sometimes 12, 16 and then we calculate the function value of those x values and then we weight them differentially because we are strategic in our choice of x and we are just not taking them equally spaced let's take a look at a couple of examples so if we want to estimate the mean of a normal variable so we have a standard normal variable so the standard normal distribution are multiplied by x and we want to know what is the mean of x so the idea of calculating a mean is that you use some all possible values of x of course we don't have all possible values so we just take the weighted sum of the probability density and actually of the x and this is the probability density this is the probability density multiplied by x and we take the area under the curve so here we have a negative part so this is negative something this is positive something and they cancel out because this is symmetric and the area under the curve is 0 which is the mean of a standard normal variable this red here is called prior distribution and understanding the prior distribution and understanding that this blue one is posterior distribution is useful when you read about for example basin analysis or if you read about how to solve problems with numerical integration because some of these problems involve the shape of the posterior distribution but for now, prior distribution is the distribution over which you integrate so we are integrating over the normal distribution and posterior distribution is the function value times the prior distribution if we want to estimate the variance of a normal distribution we would calculate what is the average square deviation from the mean the same principle we would take the deviation from the mean u 0 by definition here could be also estimated and we calculate x minus u squared and then we multiply the probability density here so we get these two peaks and the area under these two curves under this blue curve is 1 and that is the variance of a standard normal variable in practice the choice of these integration points is critical so that determines how well the integral works fortunately there are a couple of really good ways of choosing the integration points we almost always when we work with normal distributed variables we apply something called Gauss-Herby the quadrature rule and I don't even try to explain how that is derived but the idea here is that we have weights we have what's called abscissa in the literature I just call them x values because that's more familiar to me and we calculate the value of the function at the strategically chosen x values and then weight those x values and take a sum how do we know the x values and the weights well we can just take a look at a table provided in a book or many books about statistical estimation provide these tables and alternatively we can use an online calculator or a statistical program normally has a calculator which tells us these weights and x values when we integrate over normal distribution there is something that we need to understand about this Gauss-Herby the quadrature approach and it calculates integral of form e to the power of minus x square and if we take a look at the normal distribution here so this is pretty close to the e to the power of minus x square except that we have some unnecessary stuff here so we have the subtraction subtracting the mean and then dividing by the two times the variance so we need to do something called change of variable so instead of integrating over or directly over the y value we integrate over x and then we define y as a function of x so we define y as a function of x we plant that in and then that gives us the integral this same thing is actually explained in the stator user manual where the stator developers tell that well you have the quadrature points from the Gauss-Herby rule and then the actual x values or abscess here and the weights are calculated using those equations so we have the likelihood function then we have the normal density and in the general case we would calculate the abscesses like so and the weights like so for a normally distributed variable centered at zero so most of the time our variables are centered at zero if they are not we can easily just move just re-specify the problem a bit so that the latent variable is centered at zero and then we just include an intercept to the model this works remarkably well for many problems but there are cases where this Gauss-Herby the quadrature approach does not work well and the case is shown here so if we have five integration points showing the first figure but our posterior peak or the mass of the posterior happens to be right between two integration points then the integral can be seriously misleading or seriously incorrect and what we do then is that instead of using the original Gauss-Herby quadrature rule for integration we use something called mean variance adaptive quadrature there are other variants of adaptive quadrature beyond the mean variance but the mean variance is the default in stator for example and what we do is that instead of using the original the prior distribution to calculate the weights and abscissas what we do is that we take a look at the posterior distributions we estimate the mean and standard deviation of the posterior distribution and then we choose the weights based on the mean and the weights and abscissas based on the mean and variance of the posterior instead of the prior okay so that's a bit of a theory if you search for quadrature in stator user manual you'll find probably at least a thousand hits in all the manuals so this is all over the place in stator and it's also used in many other statistical software as well so how is numerical integration related to estimating of lately variable models? let's return to our example of calculating this observation level likelihoods so how would we actually proceed in estimating the likelihoods in this case so let's take a look so instead of integrating or calculating the value of the first idea at every possible place we proceed one group at a time and let's use three integration points so let's say that we take three points we have one point in the middle 0 and that receives the most weight because that is the most likely one value to get from that distribution and then we have one value from the left tail and one value from the right tail with smaller weights and if we run this with stator with three integration points we get some coefficients we get the variance of the latent variable so we don't get any coefficients for the latent variable we get the interest of the variance component to the model and we get a log likelihood of minus 23 so how does it actually work? now this is our mean variance adaptive Gauss-Hermit quadrature and we're just going to be using the non-adaptive version because it's a lot simpler so the likelihood is not exactly the same but in this case it will get close so how does the computation work? well let's take a look at first cluster then we start by calculating the likelihood assuming that the latent variable is on the left tail so the value for the abscissa would be minus 1.59 and the value of the weight would be minus 0.70 so we calculate the fitted value using this model so we enter x times 0.86 plus a1 plus constant that gives us the fitted value for the first observation and then we center a normal distribution on that fitted value we take the variance of their term and we then estimate how likely we would get how likely we would be to get the value of y of 0.60 from a normal distribution centered at 1.72 having variance of 0.75 and that gives us the likelihood the log likelihood or actually we're working on our likelihoods here and then the likelihood, the total likelihood for this cluster of four observations this group of four observations and this latent variable values is the product of all these likelihoods why we work with likelihoods instead of a log likelihoods will become clear in a moment then we repeat the same for another value of the latent variable so the second value that we try is 0 it receives more weight 0.67 compared to 0.17 because it's a more likely value from that distribution and then we go to the right tail and abscissa 1.59 so this is symmetric first we tried minus 1.59 now we do 1.59 and weight is 0.17 we calculate the log likelihoods for each observation and then multiply them together to get the observation level likelihood now the total likelihood shown on this fifth row that I've added to the data is the weighted sum of these individual likelihoods so we would take what is the likelihood of this cluster of four observations, the likelihood if the latent variable is on the left tail that is 0.013 we multiply that with the weight 0.17 if the latent variable is in the center of the distribution then the likelihood is 0.02 we multiply that by 0.17 and then the third likelihood is multiplied with the third weight and the weighted sum is our likelihood and then we can take a log it gives us the log likelihood we do this for every group so we integrate once over the distribution of the latent variable for each group, one group at a time and that gives us the likelihood of each group so we don't actually have the full likelihoods of each observation because the observations are not independent and because they're not independent the likelihood of one observation depends on another one and therefore we can't just say that one likelihood is something independently of any other variable so we can only consider the likelihood on the cluster or group level if we combine these log likelihoods together here we will get a likelihood that is pretty close to the likelihood that Stator gives it's not exactly the same because this is the non-adaptive version of the quadrature rule and Stator uses the mean adaptive version mean variance adaptive quadrature because that is purpose a lot better in practice it avoids our problems that we run to quadrature rule so this is the single latent variable case what if we have correlated latent variables if we have correlated latent variables what we need to do is to construct a nested integral so the idea is that we first need to somehow make the latent variables uncorrelated so if we remember the idea of calculating the likelihood of a model with a latent variable is to integrate and we can have if we have two variables then we have an integral and then that goes over the first variable and the likelihood of the observations that we integrate over contains another integral so if we have two latent variables we have two integrals so we call this nested integrals and this works if the latent variables are uncorrelated but most of the time our latent variables are correlated in our models and we want to estimate that correlation so we use something called Koleskid decomposition you don't have to understand what it means I don't but this is a technique that we can use to get correlated variables out of uncorrelated ones and you'll see the term applied quite often in these models that apply numerical integration what Koleskid decomposition does this is again an example using this data is that if we have a matrix of uncorrelated latent variables Cx that's our uncorrelated latent variable matrix and Cy is the estimated latent variable correlation matrix so we want the estimated correlation matrix it looks like that but we have to integrate over that matrix and what we do is that we first generate values from the uncorrelated latent variable covariance matrix then the Koleskid decomposition of this correlation matrix gives us something called a projection matrix and we can calculate projections from that uncorrelated latent variable set to a correlated one so in practice we would take the latent variables for the first group we would multiply that with the projection matrix to get a correlated set of latent variables and that gives us the correlation structure that we want this is also a technique that we use when we want to generate correlated random numbers so a computer generates first uncorrelated numbers then it applies Koleskid decomposition to the desired correlation matrix and then you get the correlated numbers so this is the list of the X variables they are uncorrelated and then we get these Y variables so we have uncorrelated X variables and when we apply the projection matrix we get correlated Y variables so what is the relevance here to numerical integration the numerical integration works or we can do this nested integration when the latent variables are uncorrelated so we integrate over uncorrelated latent variables and then we project those uncorrelated latent variables into correlated ones using the Koleskid decomposition and that gives us the value of the likelihood assuming that the latent variables are correlated so this is a bit of a technical thing but just to know that when you what you need to know is that when you see the word Koleskid decomposition that just relates to getting correlated variables out of uncorrelated ones in this application context so this is the adaptive quadrature way of integrating so you choose a couple of points strategically then you calculate function values of those points you calculate the weighted sum of the function value and if you have multiple latent variables then you need to work integrate over uncorrelated latent variables that you make correlated ones that you recast the correlated ones using the Koleskid decomposition and that gives you the value of the likelihood there is another way that is sometimes used for calculating integrals it's called Monte Carlo integration the idea of Monte Carlo integration is that instead of drawing a few numbers of these latent variables strategically we draw a lot of numbers so we just take random numbers from the normal distribution so for example 100 we estimate likelihood for each set or each value and then we take the mean so for example if we have a normal distribution variable here and we want to know what is the mean of square of normal distribution then we would simulate observations from this normal distribution calculate mean square of each observation here the dashed line shows how the mean develops so when we increase the number of observations number of replications to 40, 45, 50 we can see that this mean gets closer and closer to one which is the correct value and with about 1000 iterations we get very close to the correct value so 500 iterations this time and we get very close to the one so the idea of Monte Carlo integration is that instead of calculating strategically chosen points weight them, we take a lot of points and then we take most points that are most closest to the mean and less points on the tails and calculate the likelihood value and then take a sum so this is a lot easier to understand than the quadrature approach so which one should be applied for simple problems the Gauss-Heimide quadrature is superior it produces more precise integrals and it's a lot faster because you might need just let's say 5 or 7 or 12 integration points and this Monte Carlo technique requires typically hundreds of points but there are scenarios where Monte Carlo techniques are useful. In practice the analytical technique we apply adaptive quadratures nowadays because the Gauss-Heimide, the basic version does not work well for all problems and the mean variance adaptive quadrature is the technique that is default in many statistical software because it's more robust. So we choose weights and abscissa based on posterior mean variance instead of based on the normal distribution and there are also others so Gauss-Gonrod is another technique that is sometimes used I don't know what it does but the principle is the same but then you calculate the weighted sum and one interesting application alternative technique is the Laplace approximation so the idea of Laplace approximation is that it's not really an integration technique instead it's an approximate technique of the integral but what is really useful in Laplace approximation is that it uses only one integration point and if you only use one integration point calculation is a lot faster and this becomes a real issue when you have a high dimensional integral so if you are only one latent variable in a model and let's say we have five integration points that's five, if you have two latent variables five times five 25 integration points if you have three latent variables we are looking already at 125 integration points because it's nested integral so you need to estimate each integral multiple times and each integral has five points if you have four latent variables we have 625 integration points so this quickly gets out of hand when you have a big model what is an advantage in the Monte Carlo integration is that while it can be imprecise if the R value, the number of replications is small the user has more direct control over computation time and if you decide that you want to draw 1,000 replications then you are going to be drawing 1,000 drawing 1,000 values of latent variables regardless of the complexity of the problem this is because you don't do nested integration but you can instead draw if you have four latent variables you draw samples from a four-variate normal distribution so you get all latent variables values in one draw and if computation time becomes an issue then what you can do is either to decrease the number of integration points or you can apply Laplace approximation which scales a lot better in complex problems than these quadruple tour techniques but one thing that you should know is that if you start to simplify the integration then you lose precision so it's a good idea to always increase the quadruple points when you are running the final model and for example Stata has a special command quad check for checking if your approximations are good or not so always re-estimate the final model with more integration points but if you are just working on developing your model then for example using Laplace approximation just see what the results are quickly is sometimes a good idea