 Hi, this is Dr. Justin Esri. This is week six of PolySci 509, the linear model. And today, we're going to talk about two topics. First, the testing of more complicated hypotheses, more complex hypotheses than we discussed last week. And secondly, the application of more complex techniques for the testing of hypotheses. And of course, these two topics have a natural synergy with one another. Because very often, it's the case that a more complex hypothesis just requires a more complex technique to test. I want to kick off this week's discussion with a, by revisiting one aspect of hypothesis testing that we talked about last week, the distinction between power and size. So you might remember last week, we talked about the size and power of a test as being different things. And in particular, we talked about the size of a test as being the probability that a test yields a false positive, which that is to say, a false indication that evidence is incompatible with an alternative hypothesis. That is to say, consistent with the proposed theory. The size of a test tells you how conservative that test is in the sense of not telling you that your evidence tends to confirm your hypothesis when, in fact, it doesn't. The trade-off there with the size is with another aspect of the test, it's power. The probability that a test yields a false negative or an indication that evidence is incompatible with an alternative hypothesis. Or that is to say inconsistent with the proposed theory that you're testing. And power and size are unavoidable and necessary trade-offs from one another. And typically, it's very typical in hypothesis testing in the social sciences and in political science to focus on the size much more than on the power. That's partially because a test's power is more complicated to determine than a test's size. It's also the case that for various reasons, I think political scientists tend to value or put priority on the size of test over its power. And nevertheless, I think it's worth the discussion. So let's talk a little bit about this. So the size of a test, like I just said, is the probability of a false positive. And just taking a one-tail test as our key example here, the size of a test corresponds to the probability that you get a result, a confirmatory result under the null. So you can see I've got a little graph here showing, here's the null beta. So I'm assuming that the world is such that a particular beta is 0. And we're going to pick a beta hat star that we have decided any beta hat star bigger than this particular level of beta hat, we're going to assume we're going to take as confirmatory evidence of our hypothesis. So in other words, we're sort of in the background saying, well, what we're trying to figure out is whether beta is greater than 0. And I'm going to get some evidence and run a model and look to see if the beta is 0 in this sample data set. And if I find a beta hat that's bigger than beta hat star, I'm going to conclude, you know what, this evidence is consistent with this hypothesis. So quite simply, the size of the test is the probability that I get a confirmatory result, even when the true state of the world is that my hypothesis is misleading or not supported by this evidence. And this curve here is the PDF, probability distribution function, of a beta hat under the null. And you can see that for this particular cutoff value of statistical significance, there's some probability that even under the null, we conclude that the evidence is consistent with our alternative hypothesis. So the size of a test is just the probability that that happens, that we make a false positive mistake. And this, I think, should be review for most of you if you've been exposed to statistics and to a equal extent to the extent that you were in last week's class. If we decide that the cutoff beta hat star is equal to the estimated beta hat, which is to say we just say, suppose that this particular evidence that we've got here is minimally concedent. It defines the minimal floor for confirming an alternative hypothesis. Then the size of a test is just equal to its p-value. So in other words, if we run a regression, we get a beta hat and say, all right, suppose that this beta hat were the smallest beta hat I'd be willing to accept as confirmatory evidence, what size of a test does that correspond to? Well, it corresponds to a p-value. And of course, this is not on the notes. But if I were conducting a two-tailed test, all I would do is, if I erase this here, all I would do is just reflect beta hat star onto the other side and include the other tails area as a part of my probability of false positive. OK. The power of a test, by contrast, is and there are several ways of defining it. I think the easiest way is just it's the 1 minus the probability of a false negative. So it's the proportion of the time that you correctly detect that your evidence is consistent with your theoretical hypothesis. And so here's what I've got here. So here's a suppose this dotted line or this very small dotted line here is my estimated beta hat out of a data set. And suppose that I've got some kind of cutoff beta hat star, above which I will conclude that evidence is consistent with my theoretical hypothesis. So you can see in this particular example, my estimated beta hat is actually below the threshold. So I'm not going to conclude that my finding is statistically significant and that the evidence is consistent with my theoretical hypothesis. And as before, this curved line is the probability density function, so it just provides the probability of observing any beta hat under the null or under some state of the world. And what we're actually going to assume here is that the true state of the world is beta hat. So suppose beta hat is exactly equal to beta. We don't always have to do that, but we're going to do that in this particular example. So here's beta beta hat. Suppose that's the true beta, and this is our cutoff for determining statistical significance. Now, if beta hat were the true beta, it would have a probability density function given by this curve. And as you can see, a very large proportion of the time, this shaded area of the time, we would get results that would lead us to reject the hypothesis that beta is greater than 0. So again, our hypothesis here is that beta hat, or I'm sorry, beta is bigger than 0. Greater than 50% of the time, we would actually reject that hypothesis even though we know for a fact it's true. So the power of the test is just 1 minus the probability of a false negative. So the shaded area is the probability of a false negative. It's the proportion of the time that we would draw a value of beta hat that would disconfirm our alternative hypothesis even when our alternative hypothesis is actually right. So most statistical hypothesis tests explicitly specify their size, and that specification is called alpha. And the convention in the social sciences, including political science, dictates that alpha is usually 0.05. And in fact, very often, it's a two-tailed alpha of 0.05, although not always. Whereas the power of a test is typically not explicitly stated. It actually is a notation for it. I've heard it referred to as beta, somewhat confusingly, since beta is also the notation for regression coefficients. But why is it that the case that you very rarely hear people talk about the power of their test? Well, there's a few reasons. I've annotated a few of them here. First of all, the power of a test is not fixed with relationship to its size. What do I mean by that? Well, I can set a test like, so this is the diagram for assessing the false negative probability. I can set an alpha at some value like 0.05, but that's not going to automatically imply a strict relationship with an associated false negative probability. For example, one of the things that's going to determine the probability of a false negative is the size of the data set. A very, very large data set can have a very, very small probability of a false negative, even with alpha set at 0.05, whereas a small data set might have a huge probability of a false negative associated with it at that same alpha of 0.05, variance in the dependent and independent variable factor. So while there is a trade-off between alpha and beta, which is to say that, typically speaking, the lower you set alpha, which is the harder it is to put this, the less likely you are to have a false positive, the more likely you are to have a false negative. That trade-off exists and is true, but the trade-off is more or less binding depending on other characteristics of the data that you're working with and the data generating process. So it's unfortunately impossible to say, well, that trade-off is always the same in every case. Sometimes it's going to be very important. Sometimes it's going to be less important. As a result of this fact, researchers can only choose to fix one of these values. And what I mean by that is we can only implement a procedure by which we arrive at a convention that alpha should be about 0.05 or something. Or we could have arrived at a convention about a false negative probability, but it's hard to fix both of those at the same time. For example, we couldn't say, well, all of our tests will have a false negative probability of 5% and a false positive probability of 5%. Because in order to make that happen, we would need to have infinite control. Well, we would need to have a great deal of control over, for example, the size of the data sets that we were working with. And since this is true, it's just happened that many researchers find it more important to fix the size of a test than its power for reasons that we discussed last week. But briefly, they believe a false negative is more damaging to science or to policy than a false positive. I'm sorry, let me reverse that. They believe that a false positive is more damaging to research and society than a false negative, which is to say, it's worse to believe that a theory is supported by evidence incorrectly than to believe that a theory is not supported by evidence incorrectly. So if you think that's true, if you think that false positives are more damaging than false negatives, then it might be more important to say, well, we're going to make sure that the false positive rate is always at 0.05 or whatever it is we want to set it at. And then maybe or maybe not, we're going to assess the false negative rate after that and figure out, well, is our study really capable of serving as a source of good confirmatory evidence? So as we just mentioned, there's a trade-off between power and size in a statistical test. And generally speaking, the greater the power of a test, which is to say, the greater the probability that we will not have a false negative, the greater the probability that we will have a false positive. And you can visualize that trade-off by taking a look at the graph below. So here's a graph of the power size trade-off that I've constructed. The red distribution is the probability distribution of beta hat under the null hypothesis that beta is really 0. And you can see, if you just draw this in here, that's centered right on 0. So that's the distribution of beta hat under the null. We're going to say that the observed beta hat is, say, 1.5. So that's the distribution of beta hat under the probability that the actual beta is 1.5, what we observed. And then this dashed line here is the beta at which we judge the evidence to be supportive of our theoretical hypothesis. And in this case, we're saying that that beta is 1.96, which implies that we've got a unit, normally distributed, case here. In other words, that the t distribution is equal to the beta distribution in this toy example. So what we're going to do is just say, all right, well, here's the critical value at which we decide that evidence is statistically significant and that's supportive of a particular hypothesis consistent with hypothesis. Here's the probability that we see some, we see a false positive outcome that our evidence tells us that it's consistent with this theoretical hypothesis. But in actual fact, the true state of the world is that beta hat, or I'm sorry, beta is equal to 0. Here, on the other hand, is the probability that we have a false negative. It's a much greater probability, as you can see in this case. It's the probability that we reject or fail to reject the null hypothesis. We conclude that our evidence is inconsistent with our theory, even though, in point of fact, it is. And it's worth noting that here the implicit null or the implicit alternative hypothesis here is that beta is positive. So we're assuming a theory that tells us that two variables, x and y, should be positively related. In this world here, where beta equals 1.5, clearly they are positively related. But given this critical test, we're going to very often conclude that there is no such relationship, or at least the relationship is not positive and statistically significant, a large proportion of the time. Now, as we make the test have fewer false negatives, so we move the critical value to the left. So here's the old critical value somewhere here. We've now shifted that over to the left. So we've made it easier to conclude that evidence is consistent with our theory. We decreased the probability of a false negative. This blue shaded region drops. But the probability of a false positive correspondingly rises. And the more we shift things over, so here's an even bigger shift to the left, the more we trade off false positive and false negative results. So this is the reason graphically why there's a relationship between power and size that's a trade off. OK, so let's go into R now and take a look at how one would conduct a power analysis on an applied data set. So as you can see, what I'm doing here is drawing a sample of size 50 and setting n is to 50. I'm going to create two independent variables just out of the uniform distribution between 0 and 10. And then I'm going to create a dependent variable that's a linear function of x and z. So y is 2x plus 1.5z plus the constant 1 plus a normally distributed error term with mean 0 and standard deviation 1. So this is a classic or perfect CL and RM, classical linear normal regression model. So I'm just going to run that code, create that data set. There it is. Then I'm going to run a linear model of that data set. And here is a summary of that model. So you can see I'm hitting pretty close to all my correct coefficients in this data set. And what I might want to know is, well, given that a t-test uses a threshold of 1.96 for a two-tailed Alpha's 0.05 regression, what's the probability that I would falsely get a false negative result in this case? So what I'm going to do is use the coefficient objects out of this model. And as you can see, all I've done is just repeated the summary command but then extracted this coefficients object out of the result of that command. And what I get is a 3 by 4 matrix where each column is, respectively, the estimate standard error t-value and p-value from that model. And specifically, in the next line, you can see I'm extracting one object out of that, which is the t-value for the intercept of 2.35. And what I want to know is, suppose that the distribution of betas were centered on the beta I've actually found. So suppose the true beta was actually 0.905 for the intercept. What proportion of the time would I reject the null hypothesis? I'm sorry, let me take that back. What proportion of the time would I falsely fail to reject the null in this case? And so what I do is I call the cumulative distribution function of the t-statistic. And I say, OK, if the mean of that t-statistic distribution is 2.35, which is the t-value corresponding to this true beta, we're considering the estimated beta the true beta, 1.96 is the cutoff. What I want is the area to the left of the cutoff. I've got 97 degrees of freedom here because, as you may remember from last week, the number of degrees of freedom in a t-test or in a t-distribution is equal to n minus k, where k is the number of variables in the model. So if I run that command, what it tells me is that the area under that curve is 0.34617, which is just a way of saying, look, this test is going to tell me that beta is either less than 0 or greater than 0, but between the critical t and 0, 34.6% of the time. So in other words, if my estimated intercept were the true intercept, if the true intercept were actually 0.905, and I was using a t-test with a critical value of 1.96 to determine whether that intercept was positive or not, it is the case that 35% of the time, I would falsely accept the null. I would falsely fail to conclude that this evidence was consistent with a positive intercept coefficient. Now you might be asking yourself, well, wait a minute, you're using the t-statistic 1.96 that corresponds to a two-tailed test. Well, that's true. I am. The reason I'm still integrating from negative infinity up to that 1.96 value is that I'm presuming that my theoretical hypothesis is that the intercept is positive. And as a matter of fact, whether it's correct or not, many people use a two-tailed test to test a one-tailed hypothesis. And thus, the critical d statistic is 1.96. So that corresponds actually to an alpha of 0.025 or 2.5%, but nevertheless, it happens. So in that situation, in that testing environment, you would conclude that your evidence was inconsistent with an alternative hypothesis, with a theory that said that the intercept should be positive 35% of the time. And many people are shocked by how easy it is to reject, to falsely conclude that evidence is inconsistent with your theoretical hypothesis in tests that are perfectly well conducted with perfectly reasonable data sets. I think it's common for people to think, hey, alpha is 0.05. That means I make a mistake 5% of the time. That's true, but it's one specific kind of mistake that you're making 5% of the time. You're making false positive conclusions 5% of the time. Says nothing about false negatives. And it's extremely common for the false negative rate to be very high. So the power of this test is 1 minus that quantity I just calculated, 65% roughly. 65% of the time, this test will correctly detect that the intercept is positive when it really is. That's an OK power, not the best. One way we can tell that, or one way we can manipulate how the false negative rate is contingent on factors other than the alpha, is by, say, increasing in to 100. If I increase into 100 and just do all this again, now the false negative rate has dropped to 6.4%, much, much lower. If we drop the sample size down, say, to 25, 21.2%. And that's actually lower than when the sample was 50, but that's just due to random variation. If I were to run this code again, now there's 40%, 57%. Geez, that's terrible. 75%. So you're just getting the bad draws or good draws, depending on a particular sample. The power of these tests can be really, really awful from time to time. Generally speaking, bigger samples means better power. Another thing that influences the power of a test is the size of the coefficient. So we refer to up the constant to, say, 5 instead of 1 with a sample size of 50. As you can see, our false negative rate is now minuscule, even with a sample size of 50. If we were to drop it down to, say, 1 half, 0.5, 21%, 86%, 83%, 91%, 81%, 85%, generally speaking, bad powers. The bigger, you know, thinking about statistical test as signal to noise ratios is helpful in this context. The stronger the signal is, the easier it is to distinguish from noise. So the bigger a coefficient is, the easier it's going to be to avoid false negatives, even in the presence of a relatively small sample. Because the signal is just so strong that it breaks through the noise of the error term. So as I said before, last week we did simple hypothesis testing. And I've got some examples of what one might consider simple hypotheses here. Should we conclude that evidence is consistent with a positive relationship between two variables, or maybe with a negative relationship between two variables? Or maybe we're just asking whether there's any relationship between two variables. These come out of some kind of theory. That's the prediction they make. And we're going to see whether that evidence bears those predictions out, or at least whether a particular sample data set is consistent with those predictions. What we're going to talk about this week is more complicated types of hypotheses. And as I said, that means both more complex techniques for testing hypotheses, in particular various forms of simulation, including bootstrapping, which is actually a form of a simulation-based hypothesis testing. And we're also going to talk about more complicated questions. So for example, we might ask, OK, there's a relationship between two variables, x and y. Does the size or direction of that relationship change as some third conditioning factor, z, changes? That's what's called a hypothesis of interaction. And interactive hypotheses are often a good venue to test theories, because if a theory makes a hypothesis of interaction, it's the kind of specific, very detailed relationship that it's hard to attribute to chance. Another more complicated question is if we have two coefficients, beta 1 and beta 2, are they the same? Do they have the same effect on a dependent variable? Do two independent variables have the same effect on a dependent variable? Why? This often comes up in experiments where we're looking to see whether one particular treatment is more or less effective than another. Now, you may be thinking of medical experiments where, obviously, we have a new drug and want to see whether the new drug is just as effective as the old drug. But we are more effective, hopefully, if you've spent a lot of R&D money on it. But this also comes up in political science, where you might ask, for example, is a particular campaign advertisement strategy more or less effective than some pre-existing strategy? So different techniques and different questions are on the docket. OK, so that break was a little longer than I intended. Magically, my clothes just changed and the date just advanced by two days. I feel compelled to remind my viewers that this is a free video series. The production values are correspondingly low. OK, so with that disclaimer well in hand, let's move on. So last week, when we were talking about the T and F tests, one of the points we made was, ultimately, these tests are dependent on the N in the CLNRM assumption. That is to say, you need to really have a normal distribution of the error terms in order for these things to work. So U is distributed normally, comes out of the normal PDF or CDF with mean 0 and variance sigma squared, is the distribution in which these test is determined in small samples. The tests are valid in small samples precisely because we can make this assumption about the error term. But in large samples, however, the tests are asymptotically valid, which is to say, as long as U is homoscedastic, they work regardless of whatever distribution U comes out of. And that's reassuring because what it tells us is that in large samples, we don't have to sweat the distribution of U quite so much. It also is going to mean that we can leverage that result in order to inform our simulation-based hypothesis testing, which is really the core point of this lecture. So there is a formal proof of what's going on here. And what that formal proof is going to demonstrate, essentially, is that certain distributions come out normal in large samples, no matter what the underlying distribution of the elements is. Wow, that's a terrible description of the CLT. Let's just defer them for a second. But what I want to do before I get into formal proofs is just show you some examples in R to give you a more intuitive understanding of what's going on here. So let's crack open R and see what's going on. So let's take a look at R. And what you can see here is I'm doing a Monte Carlo study trying to figure out whether the non-normality of U is going to affect T tests in particular. And I've already told you that there's this result that, well, it's going to turn out that in large samples, the normality of the error, even though it's required for the proofs that support these tests, actually isn't going to matter so much. And what you're going to see in this R simulation is that that result, actually, in practical settings kicks in very, very quickly. So what I'm doing here is just setting up a really basic Monte Carlo study. I'm going to set the seed to some number just to make sure that we all get the same results if you run this at home. And I'm going to start with a very small sample of n equals 10. I'm going to run 5,000 simulations with n equals 10. And what I'm going to do in each one of these simulations is I'm going to draw two independent variables out of the uniform distribution between 0 and 10. And I'm going to construct some kind of error term. And in this case, I'm actually going to set up a bimodal error term. So just really quickly, so you can see what this is going to look like. Let me just set n of, like, say, 5,000 and draw a bunch of these errors and draw around that. And if I do a histogram on U, that's what the error term would look like. Its means is 0, but it's definitely not a normally distributed error term. It's bimodally distributed. Then I'm going to create a dependent variable y out of 2x plus 1 plus U. This is a CLRM. So it's a classical linear regression model. It's just that the n is not true. The error is not normally distributed, although it is homoscedastic and mean 0. And then I'm going to try to recover that DGP with a model. And in particular, I'm going to model y with x and z, z being an irrelevant variable. Then I'm going to conduct a t test of the statistical significance of both x and z. And in particular, I'm going to look at the p values associated with those two coefficients, which is to say I'm going to look to see each one of these coefficients is statistically significant with alpha of 0.05. And you can see that down here, I'm checking the p values to see whether they're less than 0.05. I'm checking to see where the shaded area under the probability density function for the beta on x and the beta on z, if the observed values I get for my t statistics are coming up in that shaded region more or less than 5% of the time. And so what I'm looking for here diagnostically is to see if these tests are right, which should happen is that the irrelevant z variable should be accepted. We should reject the null about 5% of the time for an alpha's 5% test. On the other hand, x should be accepted a lot. It's not clear actually what that acceptance rate should be, but that's a measure of the power of the test. How often do we incorrectly fail to reject the null hypothesis? How often do we incorrectly fail to conclude that evidence is supportive of some hypothesis? So what I'm going to do is just run this model and see what happens. So here we go. Go, go, gadget R. And as you can see, I included a text bar or a progress bar here so we could all see how long this would take. Perhaps I should have queued up some elevator music. Do, do, do, do, do, do. OK, there it is. So this is actually a pretty interesting result. And what you're seeing is that, in terms of power, the t-test is correctly rejecting the null 99.9% of the time. Actually, this is rounds up to 1. So there's virtually never a case where we're accidentally saying that x is not statistically significant at the 0.05 level. Z, on the other hand, is being incorrectly accepted as a relevant variable. Or in other words, we're rejecting the null hypothesis about 5% of the time, 5.1% of the time. That's really good. This is with a very tiny sample and a very non-normal error. Just look at that histogram. That's a very non-normal error. So what we can conclude here is, oh, maybe it's the case that that n part of the CLNRM is not quite so vital as our proofs would imply. And maybe we can rely on these tests a bit more than we expected. Well, maybe we're just getting lucky because we have this bimodal error that's doing it. Well, OK, let's try a uniform distributed error. I'm going to get a uniformly distributed error between negative 2 and 2. So if I were to set n of 5,000 and plot a bunch of these, just u, I'd get a flat distribution just like that. So I'm just going to rerun this whole thing again. And while this is running, let me just talk about the result we just got. What we're doing is creating 5,000 data sets, running 5,000 models on those data sets, and seeing what proportion of those 5,000 times we incorrectly fail to reject the null and we incorrectly reject the null using the x and z variables. That's the name of the game. That's what these bottom two numbers are telling us. Well, here you go. Almost perfect power and, again, 5% rejection rate with a very highly non-normal u and a tiny sample. Well, how about an f-distributed error? So let me set n of 5,000 and draw a f-distributed u here and then do a histogram on u. OK, so this is a skewed but mean 0 f-distribution of the error. So again, highly non-normal. We're going to do 5,000 simulations of a sample of size 10 and see what happens. Come on. You can do it. You can do it, R. There we go. All right, so 4.26%. OK, that's not so good. It looks like we're getting about 0.8 or even a 1% to low false acceptance of the null, which means this test is actually even more conservative than it lets on. We could perhaps get a little more power out of the test by loosening up its alpha value. In other words, its alpha value is not necessarily an accurate reflection of how often it actually falsely accepts the null. But the bias is conservative in this case. It's incorrectly accepting or incorrectly rejecting the null in a relevant variable too little of the time, which is, as these things go, pretty good news. And the power is already virtually 1. So playing around with our rejection rate to get a true alpha 0.05 is probably not even worth our time because we're already almost all the way. There are very few times we're incorrectly failing to reject the null. So the bottom line here is seemingly that even in tiny samples, it looks to me as though we don't really have that much of a problem when the n in the CLNRM doesn't hold. Now that doesn't mean that I couldn't create some case where there was such a problem, which is a perfectly valid thing to say. Perhaps if I sat here long enough, I could come up with one. But it turns out there's going to be a proof that in a larger sample makes us completely confident in accepting the results of T and F tests, regardless of the distribution of U so long as it has a mean 0 and is uncorrelated with the regressors. In other words, we don't need to rely on the normality of U to sustain our test results in large samples. And in this case, large seems to be 10. So that's a pretty small sample actually. So let's talk about this first a little bit. So here's the proof that I'm talking about. The classical central limit theorem states that for any sequence of identically and independently distributed or IID random variables, x in the set of x from 1 to x to n, the quantity, here we go, sum from i in 1 to n of x sub i minus mu sub x, where mu sub x is the mean of x, divided by the standard deviation of x, or the square root of variance, over square root of n. So n here is the sample size. And sigma x is the standard deviation of x. And mu sub x is the mean of x. That quantity is distributed normally with mean 0 and standard deviation of 1, regardless of the distribution that x came out of. Pretty neat. Equivalently, in other words, that what I'm about to write means the same thing, x bar minus mu sub x divided by sigma sub x over square root of n, where here the only new notation, x bar, is the sample mean of x, is distributed normally with mean 0 and standard deviation of 1, regardless of the distribution of x, where x is just the mean of x. Now, I'm not sure that it really means a lot to you. And if it doesn't, that's not entirely surprising. But I think there's a demonstration that will make this a little clearer to you. And it's one that I often use in my undergraduate statistics class. So here's what this little simulation is going to do. And it's a demonstration of the classical central limit theorem. I am going to assume that there's a population distribution of some variable, some variable x. In this case, we're actually looking at the distribution of beta, which is to say the coefficient in a linear regression model, which ultimately relies on the distribution of u, the possibly non-normally distributed error term in the regression. The central limit theorem says that if I repeatedly take samples out of that distribution, so going over here, coming on my face slightly, going back up to this result we just made, if I repeatedly take samples out of this parent distribution and then calculate a mean out of that distribution, a sample mean, and then I look at the distribution of sample means, as the sample size gets larger and larger and larger, the distribution of sample means will be normal, even if the distribution of the variable which we're taking the mean of is not normal. So this is like x in our proof. And it's normal now, but suppose it's highly non-normal. There we go. And what I'm going to do is take a sample of 5 out of it and then calculate the mean for that sample. So you saw I just took a sample of 5 and then dropped down the mean. And I'm going to do that again. Do that again. OK, now I'm going to do it, let's say, 1,000 times, OK? Wow, that's kind of interesting. The distribution of means looks to be kind of normal. What if I do it 10,000 times? Oh, it's looking very normal now. And now it's looking almost perfectly normal. So what you can see is even though x here is extremely non-normally distributed, the distribution of the mean of the sample mean of x when the sample is of size 5 is normally distributed. And this gets even better when we make the sample size larger, like say a sample of 20. If I take 10,000 samples of 20, not only is the distribution very normal, but it's also narrower. So we're getting a better grasp over what the sample mean is. There's less uncertainty because we're taking a larger sample. And that distribution looks very, very normal. In fact, I think you can even click here, fit a normal distribution. It fits really well. Now what does this have to do with regression? Because you might be saying to yourself, hey, that central limit theorem thing is fine, but we're not taking means. We're doing betas and conducting t tests. Well, here's the connection to hypothesis testing and the distribution of beta. It turns out that the t statistic looks a lot like the subject of the classical central limit theorem formula. Now let's look at that z statistic or t statistic again. They work out to be pretty similar. So just taking the z formula, here's the z. Beta hat minus beta 0, where beta 0 is the null beta, divided by the square root of the variance of beta hat. That is very, very similar to the classical central limit theorem formula. The only difference is you might be saying to yourself, well, OK, I can sort of see this is the mean of beta under the null. So that kind of corresponds to that. All right, I buy that. And this is the standard deviation of beta hat, and that kind of corresponds to that. But then you might be saying to yourself, wait a minute, how is beta hat a mean? Well, for reasons that are probably beyond the scope of what we really need to go into, one can conceive of beta hat as being mean. And because of that, this formula applies to its distribution. As n goes to infinity, we expect the distribution of the z statistic to approach the normal distribution, regardless of the underlying distribution of u, so long as the other assumptions of the classical linear regression model holds. So the bottom line, if you're trying to extract a practical lesson out of this demonstration, we don't need the normal part of the CLNRM if our sample size is large enough. So we can conduct T and F tests with impunity, regardless of the normality of u, as long as our sample is large enough, which feels good and is one less thing to worry about. Not only that, but we're going to be able to leverage this normality result into getting us some interesting things. It's going to enable us to do some simulation-based hypothesis testing, which is pretty neat. And that's going to be the subject of the next segment. OK, so let's talk a little bit about simulation-based hypothesis testing. Everything we've done up till now has been a test based on an analytical result, which is just a way of saying, I showed you a proof that something was distributed in some way, like the T statistic is distributed according to the T distribution. And that enables us to plug the results of linear regressions into a formula based on that T distribution that enables us to say things like, oh, well, this statistic will falsely reject. This is the way that the statistic will be distributed under the null hypothesis, that there's no relationship between x and y. And that's going to enable us to say, well, we need evidence of this particular size or this particular certainty in order to conclude that evidence is supportive of relationship between y and x, because we're going to be able to talk about this evidence that wouldn't really occur that often if there weren't really a relationship between x and y. That's kind of the implicit move being made there. It's not always the case that one can prove that such a formula exists. And it's not always the case that even if you can, that it's easy to do so, or very efficient or fast to do so. And that's where simulations come in. Simulation-based hypothesis testing is useful when the true distribution of the test statistic is unknown, either because it can't be known or because you just don't know how to know it. Or only asymptotic results are available and in a small. So if you're worried that some formula may or may not apply because some assumption might be broken in your test or, in particular, your sample size might be too small, maybe it's the case that we can do some simulation-based hypothesis testing and then we don't really have to worry about whether the conditions of this formula apply or whether we can invoke some proofs. There are lots of different ways of thinking about how to use simulation to test hypothesis. And we're going to talk about several of them in this lecture. So I want to start off with an applied example. So I want you to consider a real basic model y equals beta 0 plus beta 1x plus beta 2z. And I should actually add in, this is not an error-free model. There's error included as well. Suppose that the data comes out of an experiment. This is experimental data. And x and z are dichotomous dummy 0 or 1 variables. So we've basically got three treatments in this experiment. The control, treatment x, and treatment z. And what we're trying to figure out is, is treatment x different from treatment z? Now, there are lots of applications of this in, say, medical science. And someone comes up with a new drug for some disease. There is an existing treatment or drug for this disease. And you want to find out, look, can we advertise that this drug is better than the old drugs? And so you run an experiment. And then you compare in data, well, was treatment response better with the old drug than with the new drug or vice versa? And if you've spent $2 billion in this drug, you hope that the new one's better. But there's also some political science applications for this. So just to give a trivial example, we do treatments as well. Whenever a consultant recommends a particular campaign strategy, they're implicitly saying, look, there are lots of things you could do, I have determined that this is the one that will get you the optimal result in terms of boat chair, for example, which is typically what you're trying to maximize. And the way one might find that out is by conducting some formal experiments, field experiments, and gathering data and then running a model like this to decide, well, is there really any difference between these two treatments? Now, one thing I want to note is that naively, what some people try to do is they'll try to say, well, what I want to know is, is beta 1 different than 0? And is beta 2 different than 0? So test for statistical significance. And then maybe compare the significance test for the two. So like, oh, beta 1 had a p of 0.05 and beta 2 had a p of 0.0001. So beta 2 must be bigger, or beta 2 must be more significant. That's not a very good way to do it for a lot of reasons. For one thing, what we're interested in is not whether these two things are different from 0, but whether they're different from each other. And that is a different question. There are lots of articles written about this particular question. But for the time being, it suffices to say that what we're interested in doing is testing the hypothesis that these two things are not the same. So just writing down the hypothesis test here, we've got a null that beta 1 is equal to beta 2 and an alternative that beta 1 is not equal to beta 2. That's what we're trying to test. Now, there are lots of ways of testing this hypothesis. And some of them are analytic. So I've got an R script prepared where we do this. And I want to just start off by looking at the analytic methods. But then I'm going to transition to using simulation methods and compare the two. And hopefully they'll give us similar answers. If they give us different answers, we might have a problem. All right, so here's what I'm going to do to compare these different methods of hypothesis testing. What I'm going to do is generate a model where, again, I've got a sample size of 100. So it's an experiment. So usually sample sizes are a bit smaller experiments, not always, but often. And I've got my two treatment variables, x and z. And these are non-overlapping treatments. So I'm saying there are 33 people in the control, 33 people in treatment x, and 34 people in treatment z. And nobody gets both treatments. So if you've got treatment x, you don't get treatment z and vice versa. The outcome variable y is 2x plus 1.5z plus a normally distributed error term. And so what we're saying is that x actually has a bigger effect than z. And we want to see if we can detect that. So I'm going to run a little model here. I created the data and ran the model. If I just do a summary of that model, you can see it's actually detecting those beta values on x and z pretty well. And in our estimate, x is bigger than z. But with random variation, we might have got an x that was closer to z. We might have got an x of, say, 1.8 and a z of, say, 1.6. And you wouldn't necessarily have known whether those two things were different. Even with these estimates, the standard errors are fairly large, 0.25. So you might be wondering, well, do these distributions overlap enough? Do the distributions on beta x and beta z overlap enough such that we really can't tell the two apart? Well, I've got lots of ways of figuring that out. And the first way I've got is a t-test. Now let me go back to my one note here for a second and talk a bit about how I would do this with a t-test. So the canonical t formula, as you remember, is t equals beta minus the null over the standard error. That's the way it is. Well, in this case, the t-test for this particular case, we're interested in the difference between two coefficients. Is this difference statistically significant? So what we want to know is, is that difference significantly different from 0? Well, what I'm doing is just saying the quantity of interest is beta 1 hat minus beta 2 hat. So I'm going to plug that quantity of interest into every part of the formula where it used to be just a beta. So this bit used to just be, here's beta observed. Well now, playing the role of beta observed is the difference between the two observed betas. Before I needed the standard error of the difference between the two betas. But now I'm going to, I'm sorry, before I needed the standard error of the beta, now I'm going to need the standard error of the difference of the two betas. And I think this should also be divided by n. Never mind. That's square root of n shouldn't be there. Yeah, if you're you. That'd give me a look stupid. Anyway, all right. So I'm just going to conduct a t test using that formula and see what happens. And in R, it's going to be quite easy to do that. So the only trick is that I've got to get the variance of the difference between two variables instead of just the variance of a variable. And so I need to think about, well, what does it mean to have the difference between two variables? Well, as it turns out, that's actually fairly easy to do. You just have to know an identity. So if I've got two random variables, a and b, the variance of a plus b is defined to be the variance of a plus the variance of b plus 2 times the covariance between a and b. And while we're on the subject, the variance of a minus b is equal to the variance of a plus the variance of b minus 2 times the covariance between a and b. So I can use this identity to figure out what the variance of the difference in these two numbers is by saying, well, what I need to know is the variance of beta 1 hat, the variance of beta 2 hat, and 2 times the covariance between those two. Well, where do I get those figures from the variance covariance matrix of the model that I just ran? So here I've done. I've extracted the VCV out of the model I just ran. And I think I can display this for you. There it is. So here's the variance of x, or the beta on x, beta x. Here's the variance on beta z. Here's the covariance between the two. So what I'm going to do is say, OK, square root of VCV model 2, 2 plus VCV model 3, 3 minus 2 times VCV model 2, 3. That is the formula for the variance of the difference between beta x, beta hat x, and beta hat z. Here's the t-statistic on top, or the difference between the two betas on top that I just extracted. So when I execute that formula, I think I can print that for you, t-stat, t-stat of 2.829. Now, what's the critical value for a t-statistic with a two-tailed alpha of 0.05? Well, it's about 1.96. So we already know that the difference between these two variables is statistically significant. But I could get a p-value if I'm exceptionally lazy, or maybe I'm creating a programming package or something. In order to get a p-value, what I want to know is the shaded area under the t-distribution between my observed t-statistic and alpha 2 infinity and the relevant tail. In this case, the difference between x and d is positive. So I'm looking at the right-hand tail. And I've got 97 degrees of freedom. m minus k is 97. So when I calculate this t-statistic, I can also calculate a p-value with it. And when I do that, bam, there it is, 0.0056. This is an extremely statistically significant difference at this sample size. So I conducted this test with a standard formula-based t-test approach. Everything works fine. Nothing wrong with that. Perfectly reasonable. I could also use an F-test. And that's my approach, too. I don't think I need to belabor this too much. But what I'm going to do is just create a variable that's called x or z and call that cap x. What this is saying is suppose that these two treatments aren't really two treatments, they're just one treatment. They have the same effect, the same beta. And model 2 is actually the restricted model. What I want to do is say, all right, here's the unrestricted model where I let the two treatments have different effects. Here's the restricted model where I say they have the same effect. I want to conduct an F-test using this ANOVA package to see whether the unrestricted model beats the restricted model. In other words, whether the restrictions are statistically valid. And what do we find? Well, restricting x to equals z via an F-test tells us, oh, look, the difference between the two models is statistically significant. So the restrictions are not warranted. The restrictions are not supported. Remember that the null hypothesis in an F-test is that, well, it depends on the exact F-test you're running. But in this particular case, the null hypothesis is that the beta x minus beta z is zero or that beta x equals beta z. And that restriction is not borne out. And actually, and this is really where things get kind of freaky, check that out. Everything is exactly the same. This is just another incidental proof that F-tests and T-tests are equivalent. If you're testing the same thing with them, they give you the exact same numerical answer. So I could have done with an F-test. That's, again, a model-based, perfectly reasonable thing to do. But I could also do it with simulation. And so this is kind of the new content. So what I want to do is show you, all right, how would I do this with simulation? Well, here's how I would do it with simulation. We could directly simulate the quantity of interest, beta 1 hat minus beta 2 hat, out of the distribution of beta hat coefficients. I should have hats on them. Hat. So here's a well-known fact, Sunny Jim. Beta hat is asymptotically distributed normally with mean beta and variance covariance matrix omega under the CLRM as a consequence of the classical central limit theorem. What do I mean by that? If OLS is an appropriately specified model for the underlying EGP, the distribution of beta hats that I get out of repeated sampling from, if I do this model over and over on different samples that come out of the same EGP, the distribution of beta hats that I get will be distributed with a mean of beta and some covariance covariance matrix. I haven't really talked about where omega comes from yet, but I will in a moment. For now, it suffices to say that the variance covariance matrix of a regression can approximate omega. So the VCV of beta hat is an approximation of omega. So I can use my empirical results to computationally sample directly out of the distribution of beta hat and thus get a set of samples from the distribution of beta 1 hat minus beta 2 hat. So I'm going to say all this again in a slightly different way, and then I'm going to do it in R, and hopefully these things will kick in. So first thing, beta hat is distributed according to a multivariate normal distribution, which takes a vector of means and a variance covariance matrix sigma. And this distribution basically just looks like the normal distribution in two dimensions. So the first plot here I have is a plot of the distribution of two variables, x sub 1 and x sub 2. Under a multivariate normal, their standard deviation is 1 half. And the correlation between the two, or rho, is 0. So their covariance, in this case, is 0. And what you can see is basically that you remember the normal, you know this guy, right? The normal distribution in one dimension looks like this. Well, where you've got some variable here, x, that's distributed. And then on the y-axis, what you've got is the probability of c of x. The bivariate normal is just that same thing over two dimensions. So you get a hill instead of a two-dimensional hill instead of a one-dimensional hill. Now, the more correlated x1 and x2 are with each other, in other words, the greater the covariance between the two, the more that hill becomes more of a flat point, or a flat steep hill. So at row of 0.04, you can see that the hill, which used to be sort of a uniform bump, is now more of a elongated, sort of a wedge, I guess I would describe it. And when the correlation reaches 0.08, this is a positive correlation in this case, it's even more flat. And in the limit where row equals 0.1, where these things are perfectly correlated, what we would have is, if you can imagine, here's, let me draw. So if I had x1 and x2 here, if these things were perfectly positively correlated, what we would have is something like this, a perfectly flat line on the x1 equals x2 line, where f of x was normally distributed only along that line. So when I say a row of 0.4, row is the correlation coefficient, which you probably learned in your earlier classes, correlation coefficients can be translated into covariances. The colloquial interpretation of a row is that it's the proportion of available variance in the two variables that covaries. I don't want to belabor that too much, but it was just much more cumbersome to write out the covariances. And they don't necessarily mean anything, whereas a row is a more communicative statistic. But colloquially, again, in your VCV matrix, the bigger the covariance term in your VCV, the greater row will be. And at the maximum, all the variance in x1, in this case, might be correlated with all the variance of x2, in which case your row would be 1. But I should note that regression doesn't give you a correlation matrix. What it gives you is a covariance matrix. And that's actually good, because we are not going to really want correlations. We're going to want covariances when we simulate these things out of the distribution of beta hat. So here's how I would conduct this simulation. I know with the distribution of beta hat looks like. It has a mean of beta, and it has a standard deviation or a covariance covariance matrix of omega. What I'm going to do is I'm going to run a regression, and I'm going to calculate beta hat, which is my estimate of beta, and the VCV omega hat, which is my estimate of omega. I'm going to draw many samples out of a normal distribution centered on beta hat and with covariance matrix omega hat. I'm going to calculate the difference in beta 1 and beta 2 hat for each draw. And then I'm going to look at the distribution of those differences. And in particular, what I'm going to see is whether the 95% confidence interval includes a difference of 0. And if it doesn't, that means that there's a statistically significant difference between these two variables. Because my distribution of that variable doesn't include 0 or it includes 0 only in a very far tail. So let's take a look at an R example here. So I'm going to use the MBT norm package. The MBT norm package just allows me to simulate out of the variance covariance matrix with a multivariate normal distribution. And I've constructed a little thing to show you what this looks like. So what I've done here is I've drawn 1,000 draws out of the multivariate normal distribution with a mean of the model coefficients. So there's a vector of means corresponding to each one of the coefficients. And there are three of them, x, z, and the constant. And then with a variance covariance matrix of VCV dot model, which is just the variance covariance matrix I got out of the model. And you can see if I go into RMV norm that this is how I draw random numbers out of the multivariate normal distribution. And what it asks for is a number of draws to take what the mean of those draws should be and what the variance covariance matrix should look like. And then there are lots of other things it can ask for, but you're safe to leave those at the defaults. So I just draw 1,000 samples out of that distribution. And then I want to say, all right, well, the difference between beta 2 hat and beta 3 hat. Oh, actually, I can show you. Here's what the draws look like. So I'm just getting drawing lots and lots of betas out of the asymptotic distribution of beta hat, assuming that it's true mean is beta hat and that it's true variance covariance is omega hat. Then what I'm going to say is, all right, let's plot the density of the differences between this coefficient and this coefficient. You can see, in all these draws, there is a pretty consistent difference between the two. True. And so, all right, what's the density of this difference? Well, here it is. And you can see that 0, it's included in the density, but the p-value associated with 0 would be very, very small, which is to say that if it were the case that the null were true, in other words, we would see a beta hat. In other words, think of taking this hill here and sliding it over to be centered on 0 and then putting beta hat on that distribution. It's analogous to what we're doing right here. Very rarely would we see a difference of this size. And I think on average, the difference is about 0.5. We would very rarely see a difference of that size if the true difference were 0. An analogous way of looking at that is, well, if I plot the density of beta x minus beta z under my results and 0 is not a frequently occurring outcome under this distribution, for analogous reasons, I can include, well, that probably this data didn't come out of a null distribution. One easy way of doing this is saying, OK, well, I'm just going to plot a 95% confidence interval out of this distribution of the betas and see whether it includes 0, which is analogous to conducting a 95% confidence interval under the t-distribution when I'm doing a standard t-test for a single coefficient. And what you can see is that 0 is not included in this distribution. That's good news. That means that this difference is positive, probably. And in fact, since I created this data, I know it's positive. But if I didn't know that the degeneration process were as it is, then I would say, well, there's a good chance that this evidence is not very supportive of the null hypothesis of no difference. And it is pretty supportive of the alternative hypothesis of a positive difference. Now, that's not the only way I can simulate results. I can also simulate results using what's called nonparametric bootstrapping, which is an entirely different way of thinking about simulations. But the steps kind of look the same. So in the previous set of simulations, what I did was said, OK, because of the central limit theorem, I know something about how beta hat will be distributed. And I'm going to leverage that knowledge into being able to simulate out the asymptotic distribution of beta hat using my regression results. But suppose that I didn't want to call on the central limit theorem, or I couldn't call on the central limit theorem. What would I do then? How could I use simulation to get some results? Well, one way of doing that is nonparametric bootstrapping. Nonparametric bootstrapping tries to simulate the sampling process out of the data set in order to figure out what the distribution of beta hat would be if you repeatedly sampled the world, or history, or whatever the relevant sampling frame is. And what it does is say, OK, my data set is the best representation of reality that I have. So I'm going to assume that this is reality. And so I'm going to draw samples of size n with replacement out of the data set. So imagine this is the situation. This is what you've got here. So I've got a data set. Here's observation 1. Here's observation 2. Here's observation 3. Here's observation 4. Here's observation 5. And this data set is the best view of the world that I've got. What I'm going to do is create a bunch of bootstrap data sets. So I'm going to create bootstrap data set 1, bootstrap data set 2, bootstrap data set 3, and so on. I'm going to create a lot of these. And what I'm going to do is say, well, if I had sampled out of the world, I would get different samples depending on which exact thing I picked. And so for example, sometimes I get three twice, and four once, and four again, and maybe one. And then the next time I'd get five, one, two, two, five. And then the third time I get like one, two, four, three, four, and so on. What I'm doing is I'm sampling out of the data set. So this here is the data set with replacement. If I didn't sample with, if I sampled without replacement, I'd just always get a data set of 1, 2, 3, 4, 5. So replacement is necessary. And this, I have an equal chance of selecting each of the observations in one of my bootstrap samples. And what this does is it creates a distribution of sample data sets. It creates a bunch of sample data sets. It says, if I had sampled out of the world a lot of times, I would get a lot of data sets that kind of look like these bootstrap samples. So what I do is then I take each one of these sample data sets, and I run my model and calculate the difference in the coefficients for that model in that data set, in that fake sampled data set. Or I guess it's not a fake data set. It's a bootstrap data set constructed out of real observations. But it's obviously not the real data set, because I have multiple observations of some of the samples, or some of the sample observations. So I do this a lot of times for a lot of bootstrap samples, maybe 1,000 bootstrap samples, maybe 50,000 bootstrap samples. It just depends on how accurate you need your answers to be. Then for each one of these samples, I take the difference in the coefficients that I have, and I look at the density of those differences. And the density of those differences tells me what I would expect the sampling distribution of those differences to look like. And then I can conduct a some kind of hypothesis test, again not using confident intervals, but using confidence intervals, based on that simulated data set. So let's do it and see what happens. So what I've done is create a matrix that has 1,000 rows and three columns. The 1,000 rows, and actually, let me go ahead and set the seed so that we always get the same results here. I'm going to create a matrix with 1,000 rows and three columns. The three columns correspond to my three variables in my regression. The 1,000 rows correspond to the 1,000 bootstrap samples that I'm going to conduct, or I'm going to generate. So for I'm 1,000, I'm going to use the sample command to sample a data set of length y, so that's just n, out of the sample data set. And I'm going to sample with replacement. So actually, this is a little bit tricky. So you see that v is actually a sequence from 1 to n, 1 to length y. And what I'm doing is I'm sampling out of that sequence, and then I'm grabbing a bootstrap data sample by picking the rows of the data set that correspond to my sampled sequences. So it's a bit indirect, but it works the same way. We can talk a little bit more about that in class, if you like. Then what I do is run a model on my sample data set, my bootstrap sample of the sample data set, and store the coefficients in my beta matrix. This is going to take a while, so let me just do that. Actually, what I should do, oh, it's already done, so I guess I didn't need my text bar tracker. Then what I'm going to do is I'm going to look at the mean. So actually, let me just show you here, head, beta, boot. So each one of these columns corresponds to each one of the rows corresponds to a bootstrap sample result. Each one of the columns corresponds to a variable. This is the constant x and z. And what I'm going to do is say, well, first of all, what was the mean answer? That's going to be my estimate of beta. And then what was the 95% confidence interval of this distribution of the difference between the two. And that's going to be my distribution of the difference between the treatment effect of x and z. So if I construct this confidence interval, what do I have? Well, first of all, notice that this confidence interval, 0.22, 1.15, is very similar to this confidence interval, 0.21, 1.15. That's not a mistake. All these results give you very similar answers. And you'll notice that the difference between column 2 and column 3 is statistically significant, which is to say we can say that there's a positive difference and that x is more efficacious than z on improving y or on raising y, whatever that may be. So in conclusion, four different methods, two analytical, two simulation, they all give you very, very similar answers, although there are some subtle differences that we can discuss. And what's handy is when the first two options are not available to you, the second two options, the simulation-based options, are often available to you. And that's going to enable you to do things that you wouldn't be able to do if you were bound to using the formulas without being able to be more flexible. So these four approaches, the two analytic approaches and then the two simulation approaches, are the basic tools we're going to use to test hypotheses of interaction. So you probably have heard something about interaction terms previous to this. And if you haven't, now is a good time to start. What we mean when we say that two variables have an interactive relationship is that there's some relationship, correlation, or causal mechanism linking x to y, but the nature of that relationship depends on a third variable, z. So the way I would write this in a linear model context is something like this. y is a function of a constant x, z, and a product term x times z. And so what this means is when we consider the marginal effect of x on y, what we're going to get is beta 1 plus beta pz. And what this tells us is, hey, look, the slope of the relationship between y and x is going to depend on z. Now it's extremely common for interaction hypotheses to come up in the literature. It's become more and more common to think theoretically of relationships being interlocked in this way. And consequently, it's increasingly likely that you'll see this kind of a test or something even more complicated in implied empirical work. But just to give you one small for instance, it's probably there's a lot of people. There are a lot of people who think that there's a relationship between ethnic diversity in a state and the amount of violent political descent in that state. And now there's a lot of ink been spilled about this question. Some people say it has to do with the extent to which each ethnic group is empowered relative to their size in the state. Some people, that's sort of the Sederman's view. There's a view that it's a product of just having diversity inside the state, which is to say the more ethnic diversity you have, regardless of the power structure, the more one would expect there to be descent just because of cultural clashes. This is the Furon and Layton view where ethnic communities provide information to each other about each other inside of the community, but not between communities. And so cooperation is reinforced inside of communities, but not necessarily between communities. So for all these reasons, maybe there's a link between some version of ethnic fractionalization or ethnic diversity and violent political descent. But it might be the case that that relationship is a function of the political institutions in the state. So in democracies, for example, democracies have all sorts of power sharing arrangements and discontent release valves. You can appeal to courts, which are generally independent. You can lobby legislators who are independent from each other and independent from the executive and so on. So maybe in democracies, there's no such relationship. But in autocracies, there is such a relationship. If we were going to model that empirically, are we going to test to see whether that hypothesis were true, what we would do is say, OK, violent, oh, no. So violent political descent is our Y here. And that's a function of, all right, well, ethnic diversity and the political institutions you have in the state. But the relationship between ethnic diversity and violent political scent is contingent on the nature of the state, on the institutions created by that state. So I just wrote the marginal effect of X on Y. Well, it's, I already wrote that. So that's the name of the game. We're trying to figure out whether this relationship exists and whether it's contingent on other things. So there are lots of questions we might be interested in asking and answering in this theoretical situation and with our empirical model. One of them is actually, what's the relationship between Y and X and how does that vary with Z? That's the most basic statement of the question. Is there a marginal relationship between Y and X? Does X affect Y? Or is X correlated with Y? And in a situation of interaction, the answer to that question is going to be contingent on Z. So we need a way of looking at whether X and Y, the relationship between X and Y, is statistically and substantively significant for different values of Z. That's what I'm categorizing down here as question two. I'm looking to see whether DYDX is different for different values of Z. Another possible question of interest is just, is it the case that Z impacts DYDX? Is it the case that the slope between Y and X changes as Z changes? That question, as it turns out, can be answered more simply by just looking at the product term coefficient. If the product term coefficient is statistically significant, the slope of DYDX does move as Z moves. That's a question that's easily answered just by looking at regression coefficients. What's not so easily answered, and what's arguably the more important question of interest is, is X related to Y? And how does that relationship change as Z changes? So it's that question that's probably the more important one for a lot of theoretical hypothesis tests, but certainly the one that requires a little bit more subtle approach to hypothesis testing. And it's going to make use of some of the methods that we've laid out. So one thing we could do, for example, is take a look at analytically calculating the marginal effect and its standard error, and using that analytical calculation to determine whether DYDX is statistically significant. So what do I mean? Well, OK, so recall, this is the quantity of interest. Does Y change as X changes? And if we're going to test hypothesis about this using the T formula, you know that in the denominator we're going to need this, because the T formula is going to be DYDX minus the null over the square of the variance of the relevant quantity, DYDX. So we're going to need to know that variance. And that's a little tricky. Actually, I can write it right here. This is equivalent to asking, OK, what's the variance of beta 1 hat plus beta p hat z? And again, there are identities we can use here. You can see we're going to use the addition identity. This is equal to the variance of beta 1 hat plus the variance of beta p hat z plus 2 times the covariance of beta 1 hat and beta p hat z. Now, the problem is that there's a z stuck in there. If we consider z a non-random constant, in other words, fixed, non-stochastic, then we can use a further identity and say that the variance of beta 1 hat is, I'm sorry, the variance of beta p hat z is z squared times the variance of beta p hat. And that's just an identity where the variance of a term, a constant term, and a random term, the variance of the product of those two terms is equal to the square of the constant term times the variance of the random term. So coming up here and writing this down, there a x equals a squared there x when a is constant and x is random. That's just an identity. So we've invoked that identity. And similarly, we've got a covariance down here. We can pull the z out of there, too, but in a covariance, we don't need to square it. So there is the analytical identity for the variance of dy dx. So now all we need to do is use this to calculate the relevant variance and or calculate the relevant t-statistic or confidence interval, depending on how we want to approach this. And what we're going to do, and one thing I want to point out and that's really important is that note that this variance is dependent on the value of z, which means that a t-statistic for statistical significance is going to depend on z. The effect might be statistically significant in some regions and not in others. And it also means we're not going to just conduct one test. We're going to have to conduct what amounts to multiple tests over the range of z. What do I mean by that? Well, let's take a look at an R example. So I'm going to make some data, x and z. Actually, let me just go ahead and remove everything that's in the data set already. And also, I'm going to set a seed so that we all have the same data set. And I'm going to make some data. And then I'm going to create a model with a product term, xz, which is just x times z. And the beta on that product term is going to be negative 0.485. So then I'm going to say this is a state of data file. So I'm going to go ahead and just execute this. Now I can clear the memory again and then load that data file into my R package from scratch. Let me clear this out here. And if I run a linear regression on this data set, what do I get? Well, I get that, look, the product term between x and z is statistically significant. And just as we set it to be, it's about 0.44. This is a fairly small data set. And it's only 50. So we expect there to be, and with four variables, we expect there to be a constant in three variables. We expect there to be some degree of variation here. OK, so the coefficient on x times z is statistically significant. So we know that the relationship between x and y does depend on z. We know that, yes, this product term cannot be set to 0. And that means that our dy dx is going to change as z changes. All right, so in other words, the significance of that term is indicative of the fact that we have a problem. Or we have not a problem, but we have primary phase evidence that interaction is going on. And we need to think about that as we look at the relationship between, say, x and y. So what I want to do now is say, all right, how is the relationship of x and y dependent on z? Well, I'm going to extract the coefficients and variance covariance matrix out of the model. And then I'm going to create a two-panel plot. So let's see, how would I best do this? Perhaps I should get rid of my face for a little while so I have room for this two-panel plot. So I'm going to create a two-panel plot. And in one panel, what I'm going to do is I'm just going to plot the derivative values. So what I'm going to do is pick a bunch of values of z from negative 10 to 10, which is, let's see, is that the range of z? Yes, it is. The minimum is negative 10. The maximum is 10. So I'm going to pick a sequence of z's at which to calculate my dy dx. So z dot fits is just a bunch of numbers from negative 10 to 10. Now I'm going to plot, in the first graph, I'm going to plot dy dx for all these different values of z. So I'm creating dy dx, which is just the beta on x plus z times the product term. And then I'm going to plot that as a line. Bam. Take a look. dy dx, the relationship between y and x, is contingent upon the value of z. That makes sense. We set it up to be that way. That's the whole point. That's good news. But we want to test for statistical significance of this quantity. We want to know, does x actually have an effect on y? And so in order to do that, what I need to do is construct a confidence interval around this line. And to do that, I'm going to figure out the variance of this relationship using the formula I just wrote in our notes. So here's the VCV of x. Here's the VCV of the product term times z squared. Here's 2 times the covariance between the product term and the x beta term, or the beta term on x, I should say. That's going to give me the, and I take the square root of that, and I'm going to get the standard error. So then to construct a confidence interval, I take the derivative line here and both add and subtract t sub alpha, or 1.96, which is the critical value for a two-tailed test with alpha of 0.05, times the standard error. And then I'm going to add that to my plot. Oh, nice. You've got to include that in there, don't you? Come on, you. Go, go, gadget, standard error. And I'm going to add a zero line here, and just plot a little zero line in. What have we learned? Well, we've learned that in this little sample data set, x and y are related, in this case, positively related, for most values of z. But at z about 0.5, or I'm sorry, z about 5 or so, the relationship becomes statistically insignificant. That is to say, it becomes indistinguishable from a zero relationship at that point. So what we can conclude is that x makes y bigger except when z is big, right? So in putting it back to our original example, maybe ethnic fractionalization makes violent political descent more likely, except when the democracy levels are high, something like that. So you can see how useful this tool is. I mean, we're able to really visually in a visually compelling way, and now I just put my face and that's for visually compelling. We're able to, in a very visually compelling way, see the relationship between x and y and how that relates to z and simultaneously incorporate information about whether this relationship is statistically significant or not. So I should say this is the test proposed by Bramber-Clark and Golder in their recent political analysis paper. It's become a very influential paper in our discipline, and it's the canonical way of testing hypotheses interaction, which, again, hypothesis interaction is that their relationship between y and x is contingent on z. And these plots are something they created in order to conduct these tests, and I would say it's fair to say that's the standard way of testing hypotheses interaction in a discipline now. But it's not the only way. In other words, they use this analytical calculation using the asymptotic formula, the t formula, and the variance form identities that I showed you. But I could have also gotten this line using bootstrapping. I could have bootstrapped the standard error. How would I do that? Well, going back into my R code here, actually, I guess what I want to do first is draw out of the asymptotic distribution of betas with a fixed sigma. So here I use the variance formula. What I also could have done is I could have said, well, I know beta is distributed according to an asymptotic normal distribution with variance covariance of omega. And so what I can do is I can say, all right, I want to use the multivariate normal distribution and draw 1,000 sample betas out of that distribution according to my results. So I set the mean of the RMV norm to the coefficients I got out of my regression and the omega to omega hat, or the VCD matrix. Then I draw just a bunch of betas out of there. And in particular, I draw the betas and then I save. I save the dy dx value for a whole bunch of values of z fits for each one of those draws. So I've got 1,000 draws. That's this giant matrix. This is a 1 by 1,000, no, 1,000 rows in one column. It's 1,000 rows by one column matrix of beta x values. This is the vector of z's. And this is a column vector of the beta product term draws out of the asymptotic variance covariance matrix. And so this sum is a bunch of draws of the dy dx value. So I'm going to come in here and do that. So I'm going to do that. So I'm going to come in here and do that and plot the result. Now what I'm going to do is I'm going to use those draws out of the standard error in order to, I'm sorry, let me say that again. I'm going to use the draws I just took out of the asymptotic distribution in order to figure out the standard error or the confidence interval that goes around this line. And so what I do is just for every value of z, so if you look at the dimension of dy dx draws, it's 2,000 by 1,000. And those 2,000 columns each correspond to a different value of z dot fits. And the 1,000 rows correspond to the 1,000 draws I took out of the asymptotic distribution of beta hat. And so for each one of those columns, what I'm going to do is figure out, OK, what I need to know is the 0.025 and 0.975 confidence interval locations for these two sets of draws. And so I use the apply command to get those numbers. And then I plot that relationship. And it's kind of jumpy. And the reason it's kind of jumpy is because I'm looking at the 2.5% and 97.5% quantile for each set of draws. But I've done this calculation for 2,000 different values. Or hold on, let me think about this for a second. So I've done this. In the rows, I have the 2,001 values of z fits. And in the columns, I have the 1,000 draws from the asymptotic beta distribution. And the reason the line is so jumpy is because the simulation process, I'm repeating the quantiles, the quantile calculations, for 2,001 different values of z fits. And depending on the exact value of z fits, I might get a slightly different 2.5% and 97.5% pick. They're going to be very close to each other, but not exactly in a straight line. And so what I can do is use lowest smoothing, which I already showed you in a previous lecture, to draw a smooth line between those jumpy values to get sort of a smoother, nicer line. And that's what I just did. And what you can see is that, well, actually, I'm going to even add the legend here. So the simulation value of dy dx is very, very close to the analytical calculation, maybe slightly wider. And that is as it should be in the sense that we expect these answers in general to be very similar to each other. Because even though they're different methods of getting the answer, ultimately they're trying to get the same answer. Some are more or less dependent on certain assumptions about the process, the underlying data generating process than others. But the answer should be quite similar. So this is good news. I began to mention at the beginning of this, we could also use bootstrapping in order to get these answers. That's a perfectly reasonable way to approach the problem. But that said, the answer would pretty much be the same. And I believe in one of your homework questions, I ask you to repeat this process using bootstrapping. So you're going to do this yourself and figure out, hey, how do I do this same exact thing that he just did twice with bootstrapping? And the good thing, the hopeful outcome, is that you get a very similar plot to the one that I just got. So I want to wrap up this week's lecture talking a little bit about the inclusion of quadratic terms in a linear model. So consider, for example, the following model. Something that looks a little like this. y equals beta 0 plus beta 1x plus beta 2x squared plus some kind of uniform, or actually, I'm sorry, not uniform, some kind of randomly distributed error term. And as you will observe, the marginal effect of x on y, in this case, is beta 1 plus beta 2x. So that's right there. The reason we might consider this kind of model is because we want there to be a curvilinear relationship between y and x, or we expect there to be a curvilinear relationship between y and x. But we still want to operate inside of the linear model context. And the idea is not every form of nonlinearity, actually, technically speaking, because of Taylor's theorem, any form of nonlinearity can be handled with a polynomial linear model. But especially in these simple cases where we think the relationship between y and x might look a little something like this, or this, we don't want to necessarily have to go to some really weird exotic model to make that happen. And as you probably recall from algebra 2, a model with an x and an x squared gives you a parabolic or curvilinear relationship. And we can exploit that in the regression context in order to model curvilinear data without exiting the OLS framework. So here's an example. I'm going to generate a fake data set with 100 observations. Actually, I should probably, as usual, remove anything that's in this data set and set some kind of common seeds that we all get the same answers. And so I'm going to get 100 observations of a data set. I'm going to draw x between 0 and 10. I'm going to say x squared. And then I'm going to generate a data set here out of that. And here's what the relationship between x and y looks like. It's a noisy parabola, a noisy curvilinear model. And this is the kind of thing that you might find in real life data sets all the time, particularly if you're looking in a lot of international relations type data. This kind of relationship is extraordinarily common. And you can detect it with simple scatter plots. So if you find yourself in this situation, what you might want to do is say, all right, well, how do I handle this? Well, I want to fit a parabolic curve to this data set. And I can do that really easily with linear regression. But then I want to, as before, I want to plot the marginal relationship between the two things and see whether it's statistically significant. And this is something you're going to do in your homework. So I'm leaving it a little bit open. But I want to give you a few hints about how you might do this and let you think about it. Square term is, in effect, actually, not just in effect, in fact, an interaction between a variable and itself. So when we're thinking about testing hypotheses about the relationship between y and x, and we expect some kind of curvilinear relationship, and we're interested in seeing, is that relationship statistically significant, what we might want to do is produce, using the techniques I've already shown you, some kind of maybe you have a dy dx plot, and here's different values of x. And that dy dx plot is probably going to look a little something like for a downward-slipping parabola, it would look like this. And there's going to be some kind of standard errors around that. And so what we would be able to say is that increases in x are associated with decreases in y for some spans of x and increases in y for other spans of x. And you can generate that kind of plot using the techniques that we've already explored for the case of squared terms, and actually for higher polynomial power terms as well. So that's something you're going to look at in your homework. So I don't want to give you the answer. But I think it's productive to think of a squared term in a linear model as being analogous to an interaction term, and then treating the analysis of that squared term in a similar fashion. All right, so that's plenty for this week. I hope you enjoyed the lecture. And I will see you in class, and see you in next week's video lecture.