 Hi, this is Dr. Justin Estre, and this is week 10 of Polysia 509, the linear model. And this week we're going to talk about hierarchical linear modeling, which is a continuation of a topic we started last week, how to deal with panel or time series cross section data. Hierarchical linear modeling is one approach for dealing with a problem that exists in TSCS data, unit heterogeneity. So as you remember, unit heterogeneity, the form of unit heterogeneity we discussed last week is a case where each unit, so like every country or every state or maybe every person if you're observing one person over multiple time periods, has a different base average of the dependent variable or some kind of differential intercept. That's the form of unit heterogeneity we discussed last week. So this is a case where you've got some kind of X variable, some kind of Y variable, and you're looking at different people, and those people have the same slope, each unit has the same relationship between Y and X, but their intercepts are quite different, which is just a way of saying, last one wasn't so good, which is just a way of saying that their average rates of Y, their Y bars or something, are different due to unobserved characteristics of each unit. That's one form of unit heterogeneity, but it's not the only form. Some forms of unit heterogeneity are much more complex than this one, and hierarchical linear models are a way of dealing with some of the more complicated versions of unit heterogeneity that can exist in panel data. As we're going to discuss, HLM is a form of random effects model, which means that it's not necessarily suitable for dealing with every conceivable kind of problem that you can encounter in TSCS data. It's useful for a wide variety of these problems, but not all. And so what I want to start off with this week is a little bit of a review of some of the simpler versions of unit heterogeneity that can exist in time series cross-section data, and the key thing or the key characteristic that separates when a random effects model is appropriate versus a fixed effects model, and then we're going to use that to lead into a discussion of, okay, given that knowledge, and given that we know that HLM is a form of random effects model, when is HLM a tool that we might want to use to address some problem that we perceive may exist in TSCS data? So with that little bit of a preview in mind, let's get started. As we discussed last week, unit heterogeneity can have different consequences. Unit heterogeneity in the sense of each unit in a panel or time series cross-section data set having a different intercept is, so we're going to start off with that basic form of unit heterogeneity. Using that form of unit heterogeneity by not, just sort of not modeling it at all, is equivalent to a form of omitted variable bias. In essence, we're saying each one of our panels has some kind of different mean or wide intercept associated with it. We're just going to treat them all as if they had the same mean and estimate accordingly. As we discussed last week, that can lead to bias. The bias problem occurs when x, the regressor of interest, and the unit effects, alpha, as we did note last week, are correlated. It's easy to depict that. Here's a little graph where we've got x and then some kind of y here. Let's suppose that here we've got unit a. Here we've got unit b. Let's see. Here we've got unit c. Here we've got unit d. These are four different units. You can see just by looking that if we were to model the fact that each one of these units has a different intercept, we would draw four linear least squares regression lines. Don't need the red, like so. Each of them having the same slope, but having a different y-intercept. You can see, I get a neutral color here, the four winers, that's one, two, three, and this guy would have a y-intercept way down here, four. If you just let OLS regression run with this data in a pooled fashion without telling it that these were data from four different units, it might be inclined to draw the following line, something like that. Actually, it's not a very good one. Maybe something like that. In other words, if you didn't tell OLS that these were four different panels and that each one had a different intercept, it's probably going to fit a least squares line that not only is not a good match for the underlined EGP, but actually goes in the opposite direction. You can see why this happened. The red unit, for example, has the highest y-intercept and the lowest value of x. It's units tend to be concentrated. Observations tend to be concentrated at low values of x, whereas the black units, here we go, tend to be concentrated at high units of x and they have the lowest y-intercept. The same goes for green and blue, they're kind of both in the middle. In other words, x and alpha are correlated. Alpha red here, alpha one, is really high and its x's are low. Alpha two, the green one here, is a little bit lower and the x's are a little bit higher, so on. This is a case where you get bias, but that's not the only thing that can happen. There are also cases where x and alpha are not correlated and here we just get an efficiency problem. Efficiency, x and alpha not correlated. If I were to draw this situation, I'm just going to put the same x and y in here. Here's x, here's y. Actually, yeah, that's fine. Now I'm going to draw my units in such a way that they all have the same slope. They all have different intercepts, but their x values are basically the same, like so. Now if you can imagine trying to draw a single OLS line through this entire cloud, it's quite likely going to go in the right direction. It's going to go something like that, but our standard errors are going to be too large. The reason our standard errors are going to be too large is without telling the statistical package that these four groups of observations come from four different panels, it's just going to treat all the errors the same. This is a U and that's a U and that's a U and that's a U and so on. It's going to overestimate the U hat, the error component. Whereas if you tell it, no, no, no, actually what we have here is really four lines, each one with a distinct intercept, then it knows the errors I should really be looking for are these relatively tiny gaps between the individual panel line and the observation line and the unit specific regression line. So in other words, you allow the statistical package to partition the error into two components, one the between, so this is like the difference between units, so let me give you an example. Here's another line here with its own y-intercept and there's one explanation for the difference between the blue and the red observations is this. This is like the between units variance. And then there's components of the error that's really about movement inside of a panel, so that's like these little guys, these little observations and that's the within unit error. So you allow the statistical program to differentiate between unit error from within unit error and thus you in effect explain away the between unit error which allows you to get a sense that, ah, I'm actually estimating this slope much more accurately than pooling the data might lead me to believe. So this is a case where we don't have bias problems but we do have efficiency problems. As you may remember, fixed effects models are designed to match up or correct for this kind of situation where bias is an issue, whereas random effects models are really more designed for this situation where efficiency is the main problem. So the reason for this is, as you may remember, random effects models assume as a part of their modeling structure that the unit level variances or the unit level, the unit explanatory variables, the unit components of the variance or the between effects are randomly distributed according to some distribution, usually the normal, with a mean and a variance. And typically for a random effects model, alpha bar is zero. So in other words, the random effects are assumed to be mean zero once you account for the intercept, the y-intercept, the grand y-intercept. Thus, in effect what you have to say is the unit effects are uncorrelated effects and independently and identically distributed and that's the only source of between effects. When that's true, it's great. You get efficiency gains. When it's not true, you're going to get some bias. Fixed effects models are possibly going to be more appropriate because you're going to put in a lot of parameters and those parameters are going to explicitly model the intercepts and you're going to sort of capture any, you're going to eliminate any omitted variable bias that might exist in your data set. Hierarchal linear models are a form of random effects model. And so everything I'm going to say about hierarchal linear models is going to assume that whatever random effects we have are uncorrelated with the regressors X. So that might be leading you to ask, well, if that's true and if random effects models are always bias whenever the random effects, whenever the unit head of direction 80 is correlated with X, when can I use these things or why would I ever want to use these things? Well we'll return to that issue at the end because there is sort of a question of, well, given that random effects are biased from time to time, when and when is it okay, when is it not okay to use them? So for now just have in mind that HLM is a random effects model and part of the assumption of a random effects model is that there's no correlation between X and the unit effects. We'll proceed from there and then at the very end we'll revisit the issue of, well, when is it okay for me to make this assumption even if it's not exactly true? So setting aside for a moment the issue of random effects versus fixed effects, which is to say is bias or variance, the biggest issue to tackle here, there's another issue to think about which is unit heterogeneity doesn't just come in the form of varying slopes, it also comes in the form of varying intercepts, that is to say the relationship between X and Y can be different in different units. So for example, just illustrating with a simple graph, I'm going to use my graph tool here, just make three simple graphs, one, two, and three. We're going to say that each of these graphs shows the relationship between GDP growth and political stability. And there are three countries we're looking at, Congo, France, and the United States. Now the varying intercepts case says, okay, unit heterogeneity works out to being the same relationship in all three states, but with different intercepts. So the state with the lowest stability is the Congo for various reasons, but increases in its GDP growth would contribute to its stability. The U.S., on the other hand, starts at a sort of a middling baseline of stability, actually what do we say, the U.S., France, and so France might be a little lower actually, they ride a lot. So let's say that the U.S. is actually the most politically stable country. GDP growth increases its stability, but it starts from a lower baseline. And just in this example, we'll say that France is somewhere in between. That's a varying intercepts model, so we'll call this a varying intercepts model. But there's also such thing as a varying slopes model. We could for example say, okay, all three of these countries start at the same base, but each has a different intercept where some countries can benefit more from increases in GDP growth than others. So we might say, well, the Congo is a country that would benefit the most from increases in GDP growth, so we still have political stability on the y-axis, GDP growth on the x-axis. France is somewhere in the middle, and the United States is a place where GDP growth is not going to contribute much to stability at the margins. But this model assumes that if GDP growth were, say, suppose this is the intercept here is zero, where zero for all three of these countries, their stability levels would be the same. That's a varying slopes model, so right in here, varying slopes. And I'm sure you can guess what the third one's going to be. It's possible that we can have varying slopes and varying intercepts at the same time. So for example, we might say, look, the Congo starts from a very low base of growth, or I'm sorry, of stability, and can really benefit from GDP growth because most of its instability, one might suppose, might be because of their low GDP performance. Actually, more accurately, perhaps their state of the world, given their political and geographic situation, is such that GDP growth will help them disproportionately. I guess it's probably a better way to say it. The US starts from a pretty high level of stability, and growth doesn't change it very much. It gets a little better or a little worse, but doesn't really change much. And then France is somewhere in between. They start from a fairly high level of stability, much like the US, but are more sensitive to growth rates in their political stability than the US is. So this is a varying intercepts and varying slopes model, varying intercepts and slopes. So going back up to this little equation here, where I've got these models written out, we can write this as y, the dependent variable, is a function of the unit effect intercept and also a unit specific slope term. So the intercepts are distributed randomly according to some mean and some standard deviation, but so are the betas. And so alpha is the mean intercept, and beta is the mean slope. And the idea is we've got some kind of grand alpha here, and there's a distribution around that that gives you the unit effects. There's also, at the same time, a grand beta right here, and there's some distribution around that that represents unit heterogeneity. So as you can immediately guess from the way we've written these things out, this is a random effects model. We are assuming that the effects are distributed according to typically a normal distribution or something like it. That doesn't mean that the effects necessarily have to be independently and identically distributed, however. We could, for example, write out the following model. We could say, all right, let's suppose that y is alpha i plus x beta i plus an error term, and we're going to assume that alpha i and beta i as a block vector here are distributed according to the multivariate normal with stacked means alpha and beta and a VCV matrix sigma. So what we're saying here is basically that the alphas and betas might be sort of linked in how they're distributed. A big alpha value, so a big draw for alpha, might also be associated with a big draw for beta or a small draw for beta. They're still randomly distributed, and they still have means equal to the grand means alpha and beta, so we're still in a random effects environment. However, it can be accommodated. The idea that the slopes and intercepts within units are correlated can be accommodated via such a model. So we might say, for example, that if you start from a low, and especially low, going back to this example here, if you start from an especially low base of stability, so a low alpha, you might be especially sensitive to growth in increasing your stability, which would be a big beta. So alpha and beta, in that case, would be negatively correlated. The draws of alpha i and beta i would be negatively correlated in that instance. A hierarchical model is basically a way of recognizing this sort of correlation among the effects, and sort of even more to the point, even more complicated versions of correlation among the effects, in particular, hierarchical or nested correlations among the effects. So now I'm going to introduce hierarchical linear models, how they relate to this correlation among effects, and how they allow us to simultaneously model varying intercepts and varying slopes. So a hierarchical linear model is just a random effects model with multiple correlated random effects, where the correlation in those effects kind of takes a special form. So let me just give you an example. I suppose we've got data from the American states. So this is like 50 US states. And we've got time series cross-sectional data on these. We might think of there being two kinds of unit heterogeneity. Regional unit heterogeneity and state level unit heterogeneity. So for example, all the southern states, all the Midwestern states, et cetera, sort of bear similarities to each other. And within those regions, the states, like, for example, inside the South, Georgia, Alabama, and so on, inside the Midwest, Ohio. Yes, I know some of you think that Ohio is not in the Midwest. It is, Iowa, maybe Indiana, something like that. These states differ from each other, but share commonalities among the Midwest. So the model kind of looks like this. We've got some dependent variable, and it's distributed according to alpha i plus x beta plus epsilon, but alpha i is the distribution of unit effects. So this is the distribution of unit effects. It's, say, normal here, I'm going to draw the phi. It's got a common regional mean and some kind of variation, sigma alpha squared. So this is the regional mean, and all of the states in the same region are going to get unit effects to alpha i that share some common regional mean. Then that regional mean, in turn, mu r, is distributed according to cap phi with mean zero and standard deviation, sigma r squared. So here we've got the distribution of regional effects. I'm sorry, of unit effects with a common regional mean. And here we've got the distribution of regional means. And I've written this as having a common regional mean of zero. We could easily also say that there's just some common intercept mu, whatever. It doesn't really matter. I mean, it matters in a particular data set, but not for the overall setup. So this is a hierarchical model, and you can tell it's hierarchical, because the random effects are kind of nested inside one another as trees, as a tree. So the first level of the tree has regional effects. So the first thing we do is say, you're in the Midwest, or you're in the South, or maybe you're in the Mountain West, or maybe you're in the coastal Pacific region, something like that. And then inside of each region, there are unit effects that are distributed somehow. And so in essence, what we're talking about here is a systematic clustering of unit effects. And in hierarchical models, the hierarchy comes from the fact that these are nested inside one another. So drawing this tree as a probability distribution, here's the distribution of regional means. It's got some kind of common regional mean here. So there's our F of regional means. And then so every state, or I'm sorry, every group of states in a particular region gets a particular draw from this distribution of regional means. So this is regional mean one, regional mean two, regional mean three. Let's speed that up, regional mean three. And this forms the basis of the distribution for alpha i. So just dotting this so that it doesn't interfere. There we go. There are the three regional means. Then we draw alpha out of three distributions with three common regional means. So like Texas is in the, well, it depends on where you can. I guess usually it's grouped in with the south. So let's say this is the southern distribution right here. Texas might be here. Georgia might be over here. But it's definitely systematically different from the group of states. In Midwest, so like Wisconsin might be here, Ohio might be here, Iowa might be here, and so on. So that's what a hierarchical model looks like. Not all mixed effects models are hierarchical models. So a mixed effects model is just one where there are multiple random effects going on at the same time. And not all random effects models of this kind need necessarily be hierarchical. So for example, suppose we've got a model where y is alpha i plus gamma i plus x beta plus epsilon, where alpha i is distributed normally with some kind of group mean and sigma squared alpha and gamma, whoops, that's not a gamma. Gamma i is also distributed with some completely different group mean. Geez, I keep drawing me use. Some completely different group mean, some particularly completely different standard deviation. We might, for example, say, hey, look, there are random effects having to do with someone's region, you know, where they live. So this is like if we have people here, this is supposed as a model of people. We might say, hey, look, there's some kind of random effect by region, but there's also some kind of random effect by age cohort. And we wouldn't necessarily expect the region and age cohort effects. First of all, they're certainly not nested. Being Midwestern doesn't necessarily automatically put you into a different age cohort than a Southerner or a Westerner. Nevertheless, we might expect, oh, there is this unit heterogeneity that exists by region or by age cohort or whatever. And we're going to model those things as mean zero effects or common mean effects that are randomly distributed, sort of nuisance effects. But they're not necessarily wrapped in another. A very common thing, this maybe is not the best thing because we probably, if we wanted to model age, we'd probably just put that in as a covariate. But one thing that would be very common is suppose we have country effects and year effects. So different years have different patterns of warfare, GDP growth, all sorts of things associated with them. Different countries are also systematically different on these measures, but we don't expect any kind of nesting between them. So that is an example of a mixed effects model that's not at all hierarchical. We can actually handle both hierarchical and non-hierarchical type mixed effects models in the software package that I'm going to show you in just a few minutes. So what happens when we ignore the hierarchical structure of a data set and simply pool the data as though it had no hierarchical structure? Well, the answer to this question is somewhat complicated and one that we're going to discuss when we think about what happens when we estimate a hierarchical linear model on data where X and the random effects are correlated. But in general, just going with the previous answer of assuming that random effects is the right model and so there is no correlation between X and the unit heterogeneity. If that's true, the consequence of ignoring the hierarchical structure of data is primarily efficiency loss, which is to say we are getting standard errors that are bigger than they could be if we were to account for that hierarchical structure. So estimating a hierarchical mixed effects model gets us narrower SEs, standard errors, on our beta coefficient estimates than we could achieve by simply ignoring the hierarchy and pooling all the data from all the units into a giant estimator. Now, one thing that might be interesting to note is fixed effect doesn't mean the same thing in the hierarchical linear model context. So the fixed effects in this case, oops, was wrong, the fixed effects are the characteristics of a model that are just not random. That is to say they're fixed, which does make sense, but it means something different than a fixed effects model, you might think a fixed effects model might mean because we're no longer talking about dummy variables, we're talking about the betas on the X components. So for example, when we write out a model like the one we wrote out previously where Y is a function of some kind of random effect plus some coefficients in an error term, what we mean when we say fixed effect in the hierarchical linear model context is this, that's the fixed effect. The random effect is this bit here. If we were to allow the slopes to vary, so just give us, press off some more space here, if we were to allow the slopes to vary, so we wrote out the model Y is a function of some random intercepts and also some random slopes. Now, we have two sets of random effects. We've got this and this, and the only fixed components are the, perhaps the estimated means, so the mean of the beta distribution. Oops, and Alpha takes a distribution as well. We might be estimating these constant things as fixed effects, but the actual, the bits that are different across units are assumed to be randomly distributed and hence are so-called random effects. In other words, all hierarchical linear models are random effects models in the sense that they, when they estimate unit heterogeneity, they assume that it takes a random effects framework. The fixed effect does not mean dummy variable, least of course dummy variables or some kind of dummy variable handling of unit heterogeneity because hierarchical linear models typically don't take that approach at all. They don't do dummy variables. The second thing that you get out of a hierarchical linear model is, by not estimating an HLM, you neglect interesting structural aspects of the data generating process. So if, for example, we think that data takes this form where we've got random effects that are nested where there's a sort of region and then a state level nesting or maybe some kind of world region and then a country level nesting, it just, we learn things by estimating random effects that assumes this structure. For example, as we'll see when we go to the software component here where we start estimating these models in R, you'll get estimates of each country's unit effect and each region's unit effect. If the slopes vary from place to place, we'll get estimates of how the slopes are different between these different areas or these different units in these different regions. So we can actually learn things about how different kinds of units are different from each other both in their intercepts and in their slopes. And this is not a main point of the lecture, but this is the basis for multi-level regression with post stratification. The idea being that if you estimate a multi-level model like the ones we've described, these hierarchical models, that have varying slopes and varying intercepts and then perform the estimation and then recover how each unit is different both in its slopes and its intercepts after the post-estimation, right? Multiple regression or a multi-level modeling or regression with post stratification. If you recover that information post-estimation, you can then get information on how these units are different from each other and plot that information and learn something about the unit heterogeneity in your data. And that unit heterogeneity can in and of itself be theoretically interesting. We'll do that in some live data in R today. We'll do a little bit of post stratification where we run a multi-level model and then on American state data actually. And then see what the structure of that heterogeneity looks like state by state. So that can be pretty interesting. And if you don't estimate it, if you don't estimate a multi-level model, you're going to miss out on all that heterogeneity. Now, of course, it's worth saying that in order to get these results, in order to get estimates of alpha i or beta i differing from region to region or unit to unit, we have to make assumptions. We have to make the random effects assumptions that we've already discussed. And our estimates, our post stratification estimates, are only going to be as good as those assumptions are. So it's worth asking when this is a smart thing to do. And that's where we're going to get into the bias variance trade-off in choosing between fixed effects and random effects, which is next up on the docket. So what I'd like to do now is talk a little bit about the conditions under which using a hierarchical linear model, or even more basically a random effects model, is a good idea relative to using a fixed effects model. As we discussed in the past, a fixed effects model corrects the bias that exists from omitting a relevant variable, that is to say the unit in a regression. But there are sometimes when, as you know, omitting variables is harmless to bias, specifically when that omitted variable is not correlated with the regressors that you really care about. So the short answer here is that it's okay to use a random effects model whenever x is not correlated with the unit effect. However, the advice that comes in this form is not so helpful. The reason is in practical application, x is almost always correlated, at least a little bit, with the unit effect. And so random effects models and hierarchical linear models are very likely in most cases to have some degree of bias in them. The question is whether the bias outweighs the variance increase that you get from using a fixed effects model. What I mean by that is the following. It could be the case that, you know, here's my estimate of, here's a couple of beta coefficients, beta 1 and beta 2. It could be the case that my fixed effects model is unbiased, but has a great deal of variance. Whereas my random effects estimator, or hierarchical linear model, is biased, but has very, very narrow variance. On average, I might actually prefer to use the random effects model over the fixed effects model because basically my distance to the true point in the random effects model is reasonably fixed and could be fairly small. Whereas my distance to the true point for a fixed effects model is much more variable. It could be small, like I could get this point here. It'd be very close to the true value of beta 2 and beta 1, but I could also be out here and be very far away. So on average, the answers that fixed effects models might give might be right, but they're just so variable that they're not quite useful. So one way of trying to tackle this is to use simulation to figure out when it's a good idea to use a fixed effects model versus a random effects model. So what I've done here is created a dataset. You see set C123456. This just ensures that we get the same answers. I'm setting an N of 20 and a T of 20. So this is a time series cross-section dataset with 20 observations per unit and 20 units. And I'm going to just create some storage matrices here. And here's where the real fun happens. So let me give up, not that, let me get this out of the way. So you can see that what I'm going to do is a thousand times I'm going to draw a dataset by first picking unit effects from the interval from negative two to two. Then I'm going to draw T many observations per unit. And I'm going to correlate the regressor X with the unit effect. So unit effect tells me what basically the intercept shift is. This is an intercept shift model only. You can see that when I create Y down here. Y is one plus a quarter X plus the unit effect plus a normally distributed error term. This unit effect is just an additive intercept shift. So what I'm going to do is draw X from the normal distribution with the mean at the unit effect. So that correlates X with unit heterogeneity. I'm just going to create Y using a simple random intercept model. Then what I'm going to do is I'm going to estimate two models, a fixed effects model and a random effects model. Here's the fixed effects model right here. You can see I'm estimating Y's function of X plus as factor unit, as factor says this unit variable. You can see I'm creating it up here. It just says a number from one to 20 giving exactly what unit this is. Like it could be a state name and real data. It could be a country name and international relations data. The random effects model is the same model we ran last week, but I'm running it using a little bit different package. I'm using it using the LMER command in the ARM ARM package. You can see it right up here. This implements some functions out of Gellman and Hill's book. It was programmed by Gellman and Hill. And LMER just allows me to estimate this random effects model. And you can see I'm saying that the model is just a linear model relating Y to X plus one contingent on units. So in parentheses there's one pipe unit. That just says that one is the constant and the constant that we estimate for this model will be different for each unit. It will be a random effect on unit. Then what I'm going to do is I'm going to recover the coefficients for X out of these two models. And confusingly, for the HLM model, you actually recover it using fix F. As I told you, fixed effects in a hierarchical linear model context just means the parameters estimated that don't vary from unit to unit. In that case, this is just the beta estimates for X. I'm also going to recover the coefficients for the fixed effects model. And finally, I'm going to save the standard errors that are estimated for each of these two models. So I'm going to get estimates of beta and the standard error. So this takes a little bit to run. So I'm going to pause the video, run this and then show you what happens. OK, so the model is now completed. The simulation is now completed. You can see I've done it a thousand times. If I plot X against the unit effects, you can see that I've induced a positive correlation between the unit effect and X in particular. As X gets larger, the unit effect gets larger. So they're positively correlated. And now let's do a box plot. Of the HLM value for. Hold on just a second. Right, so I've got a box plot here of the HLM estimates, a thousand of them that I simulated and the fixed effects estimates, a thousand I simulated, all of my thousand estimates come out of the same model, which is one plus a quarter X plus the unit effect plus the normal distribution error, the normally distributed error. But each is a different sample. So this is just a basically measure of sampling variation. The true coefficient is a quarter or point two five. And you can see I've indicated that on the box plot. The HLM estimates are biased upward, which makes sense because the unit effect is correlated with positively with X and we're omitting in some ways the unit effect. We're only treating it as a randomly distributed variable with mean zero. So the HLM is telling us the betas are bigger than they really are. The fixed effects model, on the other hand, is giving us the right answers. On the other hand, if you look at the standard errors, this is a plot of the standard errors for HLM models and fixed effects models. You can see that this is the standard error for beta X and the standard error is considerably higher for fixed effects models. The standard error is centered on about 0.9 in the hierarchical linear model and about 0.15 for the fixed effects model. That's a substantial increase. And what that translates to is a higher proportion of false negatives in the fixed effects case. You can see what I've done is added up the number of times in this model that the hierarchical linear model falsely rejected the null. So we know the null has, I'm sorry, incorrectly accepted the null. That's what I meant to say. We know the null is false here. The coefficient is actually a quarter, so we're supposed to be rejecting the null. Nineteen times the hierarchical linear model fails to do that. Three hundred and nineteen times the fixed effects model fails to do that. So that's a pretty inefficient model. In the case of fixed effects, we're getting 319 or 31.9 percent false negatives, only 19 false negatives in the case of the hierarchical linear model. Of course, this also means that the size or the propensity to produce false positives for these models, for the HLM model is also worse. So now I'm doing the same exact simulation again, but you can see I replaced the coefficient on X with 0. So what I'm going to do is I'm going to run this model where the null is true and figure out the proportion of the time that the hierarchical linear model rejects the null when the null is true. OK. All right. So that simulation is now completed. You can see that the hierarchical linear model and I used a two tail test here looking for any significant rejections of the null hypothesis. We're rejecting the null 587 times out of a thousand. So nearly 59 percent of the time. That's a terrible size, whereas we're falsely rejecting the null in the fixed effects model about 5 percent of the time, 0.5 or 5.6 percent of the time. That's exactly what it should be doing. So in the hierarchical linear model, you're getting a narrower estimate of a biased effect, whereas in the fixed effects model, you're getting an unbiased estimator, an unbiased estimator, but also a fairly variable one. So for this particular situation, perhaps it's not the best idea to use fixed effects models. You can see that we're getting kind of bad results, but the answer can be different in different situations. And you would want to assess that by looking at, OK, what you just sort of ask the relevant question, how accurately am I estimating a beta coefficient or a Y hat or whatever it is I'm interested in estimating, figuring out what's giving me the best estimate of the thing I care about, and you can accomplish that with simulation. And the answer can be different from case to case. So now what I want to do is go back to my notes here, give you a couple of facts about hierarchical linear modeling and then a bunch to show you a bunch of examples. So hierarchical linear model estimates are a compromise between two other kinds of estimators. The complete pooling estimator and the no pooling estimator. A complete pooling estimator is one where you just say, OK, I'm going to run Y is a function of X beta done. You know, I'm going to estimate that model. A no pooling estimator, so in other words, no treatment of the unit effects at all. A no pooling estimator is one where you say, OK, each unit, I'm going to say, say, one to, you know, well, I'll do I, each unit gets its own separate estimator. So I'm going to add in the error terms here. Well, I'll just do that. Actually, I'll do it this way. Y hat is a function of X beta hat. In this case, every single estimator gets every single unit gets its own estimator. So I is one to N, where N is the number of units. So for unit specific intercepts, so in other words, for the no pooling estimator here, where we're just saying only the intercepts vary by unit, this is the least squares dummy variable or fixed effects model. If the intercepts and slopes vary, intercepts and slopes, this is equivalent to running a separate regression, spelled it wrong, separate regression for each panel. So hierarchical linear models interpolate between these two different kinds of findings. For example, the betas that they estimate will be in between the two, literally the Y hats will also in some sense be in between the two. And in fact, how similar HLM is to one of these two estimators is contingent on what the heterogeneity, the structure of the heterogeneity looks like. So for example, if unit heterogeneity is small, then HLM is pretty close to the complete pooling estimator. Which is just a way of saying that if there's not a lot of unit heterogeneity, then it's the case that basically HLM's gonna tell you that there's not a lot of unit heterogeneity and the betas are gonna be pretty close to what would have happened if you just dumped them all in. If unit heterogeneity is large, guess what? HLM is pretty close to the no pooling estimator. And it turns out that HLM is most useful for low to moderate levels of heterogeneity. The point being that if you really have N panels with N completely different data generating processes, probably it's just best to run N regressions. Whereas if you have N panels and they all have similar DGPs but there's some heterogeneity, then HLM is gonna be better at absorbing that. The idea, the basic reasoning for this is that HLM doesn't assume N many completely different and unrelated data generating processes. It assumes one DGP with some level of unit heterogeneity. So the closer that your data is to that basic framework, the better it does. And that tends to be better when the levels of unit heterogeneity are comparatively small. The fewer the observations per unit, the closer the estimate for that unit is to the complete pooling estimator. Which is to say that if you've got like say five states, you know, Virginia and Georgia and Wisconsin, and you've got say like eight estimator or eight, so call this T. I've got eight data points for Georgia and for Virginia, you know, 53 for Georgia and 61 for Wisconsin. Then the Virginia estimate will be pretty close to the complete pooling estimate. So what we would have gotten if we dumped all of these in the same model and run of regression, the Virginia estimate is gonna be pretty close to that. The Georgia and Wisconsin estimates are gonna be more distinct. This is the idea, this is sometimes called borrowing strength. Units that have a comparatively low number of observations and hence a great deal of uncertainty in their estimates borrow strength from the complete pooling or sort of all the data estimator to extract as much information from the data as possible. So in other words, we assume in effect, if I don't know very much or only a little bit about a particular unit, it's probably pretty close to the, I'm gonna assume that it's pretty close to the grand average or the grand complete pooling estimate or whatever it is I'm interested in. The more information I get on that unit, the more I'm able to differentiate it, the more finer grained I'm able to, the more finely grained I am able to differentiate it from the complete pooling estimator. And this is typically looked on as a positive aspect of HLMs because they can help you as long as the assumptions that go into them are reasonable, they can help you knit together your data and make the most of limited information about particular panels. All right, so that's the sort of theoretical what is an HLM, how does it work? Now let's get down to actually estimating these things. We've already done this a little bit in the simulation, but now I'm gonna really get to it and do some real estimates on real data. So let's open up R. Okay, so what I'm gonna do is use the same data that we used for lecture nine. This is state level data on a bunch of, on murder rates effectively and things that might cause murder rates. I'm going to only look at the data between 1977, actually 1978, my apologies, and 2000. And I'm also gonna eliminate the district of Columbia from the dataset. So if I do a view of this data, this is what I've got. I've got state level data on a whole bunch of different things, white population, murder rates, execution rates, all kinds of things from 1978 to 2000. That's what I'm looking at. I've got it for all 50 states excluding the district of Columbia. Now, get rid of that. Now, I'm going to basically just do some data transformation. I'm going to calculate per capita spending as opposed to aggregate spending for a bunch of things, welfare, police, education. I'm gonna calculate the log of population instead of the raw population numbers. That's just to smooth out the very large numbers. California has many millions, Wyoming has 300, 400,000. So I want to kind of make those differences a little more manageable with the log transformation. And the same thing goes for a per capita income. So the first thing I'm gonna do is replicate the models from last week's class. And I'm gonna start off by replicating a simple random effects model relating murder rates per 100,000 to Christian adherence as a percentage of the population from zero to 100. Police per capita spending, education per capita spending per capita, log per capita income, log population, and this random effect. So you can see the parentheses, the sub parentheses here tells data, or I'm sorry, tells R, that I'm going to estimate a random effect. The one on the left-hand side means it's gonna be a constant, which means it's a varying intercept. And STFIPs, after the bar or the pipe or PIP or whatever it's called, tells me that I want this constant to be different according to the state FIPs code, which is just a numerical indicator of state. So if I do a summary of that model, which is right here, you can see that here are the fixed effects are the beta coefficients for the X variable. So for example, every one you increase in Christian adherence causes up four hundredths of a person fewer murders per 100,000. So if we multiply this by 100, it would be four fewer murders per 1,000. I'm sorry. No, no, that's not right. So it would be, yeah, yeah, four, yeah, four fewer. No, no, think about that. It would be four hundredths of a person per 100,000. So that's a pretty small decrease, but statistically significant. Police per capita spending is negatively and powerfully associated with murder rates. Every one unit increase in police per capita spending causes an 18 person per 100,000 drop in murder rates. You may remember that we estimated this last time using the PLM package. And if I just run that same model using the PLM package and then look at model RE PLM, you'll see that, I'm gonna get rid of this. You'll see that the coefficients are actually slightly different. For example, the police per capita spending in the LMER package is 18.77, whereas in the police per capita package it's 19.55. That's because even though these are the same model, they're using slightly different techniques, numerical techniques to find the answers. And those numerical techniques don't always get exactly the same answer. But they are pretty close. You can see that the coefficients, actually there are some, it looks like there are some big differences. All population 0.55 versus negative 0.94 and they're both statistically significant. That's kind of interesting. Looks like there are some pretty big differences. That's weird. Anyway, the point being that these are, or at least they should be fundamentally the same model. Actually, hold on, let me try something real quick. Right, so you can see that this really is a function of the method by which the variance components and the random effects estimator are being computed. I've done two different models. I'm gonna expand this out here. One of them, I did the panel linear model with the default SWAMI aurora transformation. You can see I get some coefficients here. If I change that random method to the amamiya transformation, I get very different coefficients even though this is allegedly the same model. I could even translate this to something. Maybe I could, if I grab this here change to something like the nerla of estimates. Summary. Oh, look at that, completely different estimates. So it looks like this particular model is actually quite sensitive to the way that we're computing the random effects. One thing I do notice is that the SWAMI aurora transformation seems to be much different than all the rest. All the other ones seem to be putting a log population estimate of about 2.8, which is similar to the L-M-E-R estimate. Let's see, where are we here? Right, negative, it's at least in the same direction, negative 0.9. But that's pretty weird, I have to say that there's such a big difference computationally. It makes me a little bit nervous about the validity of these estimates. Nevertheless, we'll soldier on as though nothing is wrong in the proud tradition of all methodology. So there are lots of differences in these estimators. They all operate according to different parameters. You can get slightly different estimates. I'm gonna focus on the L-M-E-R model since that's the one that involves the ARM package that we're focusing on for hierarchical linear models. So if you want to extract the coefficients for all the states out of an L-M-E-R model, you can use this coethmodel.re.L-M-E-R. And what it'll show you is exactly how, and I'm gonna move this over a bit so you can see more, exactly how all these states are different from each other. And the model we've specified is set up in such a way that only the intercept is different for each state. You can see the intercept is estimated differently for each state, whereas all of the other, the coefficients on the X's are all the same because we've set it up to be that way. This is what I meant by post stratification being an interesting process. You can, now, the model is telling you exactly how these states are different in terms of their murder rates. So Alabama's murder rate, for example, is considerably higher than Delaware's, but lower than, about the same as Florida's and lower than Georgia's. That's kind of interesting to me. If you want to see the averaged over all states equation, you can use Fix-F, where this tells you the average data generating process. Fix-F is the average data generating process over all the units, and we can get standard errors on those by Se.Fix-F. So those are the standard errors on average over all the states. If you want to see how much the intercept shifts away for each state, you can use RANF. This is how different each state is compared to the grand state average. So Alabama has about four more murders than average per 100,000. Hawaii has about four fewer than average. And we can get standard errors on that as well, like so. The standard errors are all the same in this case, because the sample sizes are all the same for each state. Now, here's a bit of code. I don't want to belabor this too much, because there's, I mean, a lot of coding stuff that you can look over on your own time. But what I'm going to do is make a confidence interval plot of the intercepts, which means I'm going to need to move this over again so I can do a nice plot. This confidence interval plot, what I'm going to do is set the mean equal to the random effect for each of the states. And I'm going to calculate the standard deviation is just the standard error of the random effects. Then I'm going to create an error bar plot that error bar thing is in the HMISC package. And what this is going to allow me to do is plot the difference between each state and the grand country level average or the United States average in murder rate for each state. And what you can see is the standard errors are quite narrow, because we have a lot of data for each state. And moreover, they're all the same standard errors. If you zoomed in more, you could see that. But now I've got a real nice visual depiction of, okay, you know, Maine has got a lower murder rate, Iowa, Idaho, Hawaii, Delaware. It looks like rich Midwestern and Southern states, or I'm sorry, rich Midwestern and Hawaii's like a far Western state, Vermont. So the Northeast seems to be implicated here, South Dakota, North Dakota, New Hampshire. Low population states are in here too. They're getting lower intercepts, whereas Texas, New York, Nevada, California, Illinois, Louisiana, Southern, and large population states. Even though we have population as a variable, that's what seems to be indicating bigger errors than average. And you would see differences in the error bars if we had different numbers of observations in each state. So what I'm gonna do is just create a random selector variable and drop, I think, 35% of the data, and then re-estimate this whole thing. So I'm not gonna belabor this, just rerun this whole thing, having dropped randomly 35% of the data. You can see I have reported here how many observations I now have in each state. So some have 10, some have nine, some have seven, some have six. You can see that comes out in the error bars now being different. The greater the number of observations that a state has, the narrower its error bars are, like Iowa has 11. You can see Iowa is pretty closely estimated here. Alabama only has six. Alabama has a wider confidence interval. So the model will only tell you, oh, I'm less certain about Alabama's position if I have less information about Alabama, which is exactly what you would want it to do, I think. Now suppose I wanted to run a hierarchical model. So what I'm gonna do is do a hierarchical model where I first have different regional intercepts and then I have different state level intercepts. So I'm going to create a region variable as data dollar sign region is just the region variable I already had except as a factor, which is a variable format I'm sorry that state I can understand for this particular example. So I'm just gonna do that real quick. Then, here's what the levels of that factor variable look like. Have I attached data? Hold on a second. There we go, I had a small error there. So I'm going to, hold on, I'm gonna actually detach the data set, detach, yeah. And then I'm going to reattach this data. Here we go. These are the levels that are in the region variable Midwest, Mountain West, Northeast, Pacific and South. And now I'm gonna run a new model where region and state FIPS code are both random effects on the constant. So this is a hierarchical model. State FIPS is nested inside of region. You can see that the random effects are up here in the printout and you can see that state FIPS is listed above region. Here are the fixed effects. And as you can see, coefficients are a bit different than we had before, but basically the same. Log population has a negative effect on murder rate. That's kind of weird. I would expect population to have a positive effect on murder rate. Per capita income has a positive effect on murder rate. Again, I would expect it to have the opposite. So maybe some weird stuff going on here, but nevertheless, we'll proceed. I'm gonna store the random effects for region and state and then plot these things out for you using that error bar command. And then I'm gonna plot the region effects right next to it. Actually, this might look a bit better if I hide my face for a moment. There we go. So you can see what's happened here is that we now have pretty good estimates of both region and state level effects. The regional effects are a little bit none of them are actually statistically distinguishable from zero. So it may be that we don't need region effects here. The state effects are distinguishable from zero. And we're seeing similar patterns to that we saw last time. Louisiana has a strong positive error on murder rate, Illinois, California, Texas, New York, and so on. Other things have a strong negative association with murder rate. So the final thing I wanna do, given that there are some issues remaining here, is maybe try to look at whether the effect of education per capita spending varies according to state. We can start playing with different kinds of varying slope and intercept models. And you can see here is what I've done is I've just dropped the region variable and I've now had my random effect as one plus ed spend PC, which is by state FIPS, which is gonna allow the intercept to vary according to state FIPS code. You can see that ed spend PC is now in the random effects. It's still in the fixed effects too. That's the average impact of education spending on average in all the states. One unit increase in education spending per capita causes about a quarter or fewer murders per 100,000. The intercepts are also different for each state. So let's plot all the slopes for different states the way we plotted the intercepts before. So what I'm doing now is just plotting a bunch of slopes. This is all the slopes for education spending state by state. And what you can see is that in Florida, for example, education spending has a very strong negative relationship with murder rates, which kind of makes sense given the fact that if you, Georgia as well, California has a little less so, Alaska has a strong neighbor effect, Texas, Nevada. The places that have strong negative effects, Wyoming, seem to be places that have low education spending to begin with. So they're at the bottom of a declining marginal effects curve where initial increases in spending cause big, big positive outcomes. In other words, a little bit more education spending in these states would cause a big difference in terms of their outcomes. Whereas Kansas, Connecticut, there aren't a lot of big positive relationship states, Wisconsin maybe, but a lot of zeros and slightly maybe positive effects. These are states that already spend a lot on education. Wisconsin spends a lot on education. Connecticut spends a lot on education. So these are states that are at the top of that marginally declining effects curve where their education spending is probably not gonna reduce murder. Maybe in some crazy way, we'll even increase it simply because that money spent on education could be spent on cops. Finally, and just more of as an example than anything, I wanna show you a case where we can make the intercept and slope coefficients that are random not correlated with each other. And what I've done here, you can see what I've done is I have a constant being predicted by state FIPS, then I have zero plus ed spend PC also separately being predicted by state FIPS. This causes those two effects to be non-correlated with each other. And the difference between that model and the model we ran before is, well, it's there. You can see education spending has a positive effect in this model that assumes that education spending and the intercept are not correlated where it has a negative effect there. That's kind of weird. In fact, I would probably go so far as to say that if you saw this, I would say, well, probably I have good theoretical reasons to expect that high murder states also have low education spending because they're spending all their money on cops or not spending their money on anything. Maybe they have low taxes. So given that I have a theoretical reason to believe they're correlated and given that this result where they're not correlated is so weird, I probably would wanna default to the correlated model. We could test that by looking at the AIC or the BIC. So the BIC is 4192. Here the BIC is 4222. So it's bigger. This is saying that the uncorrelated model is not as good as the correlated model. So 4142, 4177, the AIC says the same thing. We should go with the correlated model. So we can use those model comparison tests we talked about to sort of adjudicate among them. And both of those tests seem to prefer the model where education spending's varying slope is correlated with the varying intercept. All right, so there's your crash course in hierarchical linear models. I hope you enjoyed it. And just to sort of say something I already said last week, a panel in TSS data is something you can spend a long time learning how to deal with. These two lectures are only the slightest scratch of the surface. There's so much more that I just didn't tell you, didn't have time to tell you. That's why most programs, including ours, offer a full course in longitudinal or panel data analysis. And I really encourage you to take it. This is designed to be a nice primer that makes you good enough to be dangerous for initial analyses. And also maybe wet your appetite for all the complexities that can exist in panel and time series data. If you listen to this lecture and have a lot of questions, had a lot of doubts about what he's doing makes sense. Why is this crazy thing happening? What about this source of heterogeneity he's not talking about? Chances are all of those critiques are good ones and they've maybe, maybe not been dealt with in the literature. This is why there's a whole, actually there are many classes one could offer, but this is why in most programs there is a whole class on it. So I really hope that you find some time to take it. And if I'm teaching it, I hope to see you there. Anyway, see you next week.