 Hi, I'm Dr. Justin Essary, and this is Week 5 of PolySci 509, the Linear Model. Today, what we're going to talk about is hypothesis testing in the Linear Model. And I want to kick off our discussion by talking about the relationship between hypothesis testing in the Linear Model framework and the larger project that social scientists are engaged in, which is to say the creation and refinement of theories in light of evidence about their usefulness as predictive and explanatory tools. So what I'm going to do is start off by drawing a somewhat simplistic, but nevertheless useful diagram of the scientific process. So somehow we come up with a theory. Now, I say somehow because theory generation is really not the primary topic of the course, nor have we spent a lot of time talking about how one comes up with a theory. But I presume that you've taken maybe some classes in game theory, maybe some classes in research design, and certainly a large number of substantive classes that are giving you ideas about previous theories that have been developed by other scholars and giving you the tools to develop your own. So I'm just going to black box that and say somehow you come up with a theory. And what that theory does is I'm going to come in here and say generate some say generates some predictions. So I'm going to come down here and say the theory generates a set of predictions, which are just expectations about relationships that if our theory is a good guide to explanation and prediction of the social world are things we should expect to see when we check for the empirical veracity of our theories. So then we take these predictions and I'm going to draw a little box around that, and compare them to empirical evidence that we gather in some fashion. So I'm going to say compare to and the idea here is that we want to confront our predictions with some kind of empirical evidence, maybe quantitative, maybe qualitative, that will give us indication of just how useful the predictions are and in turn how useful the theory that generated those predictions is. And then finally, once we've done that comparison process, we want to take that result and revise our theory in light of what we found in the comparison of predictions with events. So that's a simplistic blackboard model of the scientific process. And what I want to do is sort of say, well, I can cut this process in two. So I'm going to draw a line down the side here. And on one side of this process, on one side of this line, we have theorizing. There are lots of ways that that can happen. For example, it can happen game theoretically, it can happen informally, it can happen via other theory generation processes. What we're more concerned with presently is what's on the other side of this line. And there are lots of ways to confront theoretical predictions with evidence. But of course, given the nature of the material, our primary objective is to examine this through the eyes of statistics to say, how can we use statistical tools to give us insight about how theoretical predictions match up with events. Now, one thing I want to say is that up to this point in the class, we've really been more concerned with studying events. Now, what do I mean by that? Well, there's been some recent papers, there's a paper and a book. So Clark and Primo, I don't think so. Clark S. Primo, wow. One more time, there we go, Clark and Primo. The article was the 2007 Perspectives article. They also have a very recently published 2011 book that makes a similar point. We don't really see events as they are in their totality and as a reflection of the truth. What we see is a model of events. In fact, there are sort of multiple dimensions in which we don't see the truth, but only a model of the truth. One reason for that is because all of our observation is moderated by theory. When we see an event, when we interpret it, when we structure it, when we derive understanding from it, we are necessarily taking that data and filtering it through the lens of interpretation, which is to say through the lens of theoretical ideas that we've already built up over time. There's no such thing as seeing an event as it is. We see it through the lens of what we know about the social world. For example, this sort of trivial thing, when you see a political event like a shooting, like there's that really famous picture during the Vietnam War of South Vietnamese officer executing a prisoner at gunpoint, certainly there are things about that sort of bit of data that are true just by looking at it, but most of our interpretation of that data comes through our understanding of the context. For example, just trivially we have to know where the people are in the photo, we have to know where they come from, we have to know their background, something about the historical context. We have to know something about, again, breaking it down to the simplest things. We have to know what's going on here, what an execution means, that the fact that that is an execution. Imagine just that one photograph being so layered with meaning and interpretation. Now, trying to interpret a large end data set necessarily involves many layers of interpretation, starting with the data gathering process, the coding process, anyway. There's a long chain of events that leads to the, that yields data, and it's not just looking at the world as it is, but even more to the point. When we run, we talk about running a statistical model. This class is OLS, the linear model. We don't look at the collection, we don't look at the relationships between data just as they are. We build a model that helps simplify those relationships, add some assumptions to the data set in order to make the relationships in that data comprehensible to us. As I mentioned in the first and second classes, a statistical analysis or any kind of empirical analysis is a product of the information contained in the data and the assumptions that are added to that data to knit it together to make it easier to understand, to make it more comprehensible, and to make our results more certain and more efficient. We're not looking at events as they are, we're looking at a theory mediated model of events. That's why what we've been talking about up to this point really is about structuring events. We've been talking about building models using data in order to provide a simplified and hopefully useful picture of the relationships that exist in the data. Now what we want to do is take that picture and compare it to the predictions that we get out of our theory. We want to check to see whether our empirical model gives us a picture of the world that's predicted by or that is anticipated by our theory in some way. I want to continue on this theme just for a second and follow up on something, another point that comes out of the Clark and Primo work. When we think about comparing the predictions of models to the observations that we see in data, we don't necessarily mean that a better model is one that is more universally predictive or one that's perfectly predictive. There are lots of ways of approaching this, but one interesting way of approaching it is through the metaphor of a subway or some kind of map, like a subway map. One may analogize a statistical model, actually let me take that back. One may analogize a theoretical model to a subway map in the sense that it is either useful or not and its usefulness is defined in relationship to some specific purpose. Typically that purpose is not to provide super accurate point predictions of all the dimensions of a particular social phenomenon. The subway map does not provide super accurate predictions on all dimensions of the geographical area that it covers. For example, if any of you have ever looked at a New York subway map, it doesn't contain very good topographical information about New York. It does not provide good information about the relative distances between subway stops. In fact, the subway stops are sort of evenly spaced on the map and in real life they're not. The curves of the subway lines don't exactly match the real curves of the subway system. In fact, even the things that the subway map is supposed to predict are not perfectly predicted by the subway map. So for example, one thing a subway map is good for is figuring out where you need to get on and off the subway in order to get from point A to point B and where you can transfer from one line to another when certain trains are arriving at certain times, like when do you need to be at the stop in order to get on. But all of those predictions can be wrong from time to time. So a map may say a train is arriving at 10.20 but it's running late a certain day because of a mechanical failure or a train or a map may say, oh, I can get on at such and such a stop, but that station is closed for maintenance or due to some unforeseen event. None of these critiques, none of these shortcomings of a subway map make the map a bad tool. It's an incredibly useful tool, a vital tool for navigating the subway system and getting around a large city. The analogy is that a theoretical model should be similarly regarded as a tool evaluated in light of its usefulness for some purpose. And that purpose is typically not to provide dead-on point predictions for some event, but rather to provide a framework that allows us to make sense of that event and explain it. Now, I argue, I believe that even that relatively limited view, or not limited but rather charitable, I guess is the right word, or less demanding view of theoretical hypotheses and their theoretical models and their relationship to data, I believe that even that more limited or that less harsh view necessitates some degree of accurate prediction. What I mean by that is, for example, to return to the analogy of the subway map, a subway map does need to make certain predictions well in order to serve its purpose. It needs to on average and in general accurately predict arrival and departure times and to accurately depict which stations connect to which, which lines connect to which, where stations are located, and so on. If it doesn't predict those things, it's not going to be useful for its ultimate purpose of getting around. And a theory is not going to be useful for understanding and explaining social events if it can't, in some ways, provide replicable or reliable predictions of what we expect to happen in the future, especially, particularly in the case where events are not one-off events but are repeated. So for example, if we want a model that helps us to understand certain kinds of wars, one thing it probably ought to be able to do is explain when those wars or predicts, when those wars are more or less likely to happen with a reasonable degree of accuracy. So I wanted to bring up this Clark and Primo critique because I think it's important to not demand too much of theoretical models when confronting them with evidence. But nevertheless, I still think it's useful and interesting to think about confronting theories with evidence in light of their predictions, which is to say, to ask how accurately a theory's empirical predictions match what we see in the world. That's what we're going to be talking about today, and what we're going to be talking about is how I use a statistical model to test hypotheses such as those generated from a theoretical model. Now at first plus, you might think that what we're doing here is really, really simple and easy. Of course, I know how to test a model's predictions. Here's what I'm going to do. I'm going to have a theory, and that theory is going to give me some predictions that kind of look like this. They look something like, okay, as x changes positively, y will change positively. Or maybe something like, as x changes negatively, I expect y to change negatively, right? I'm sorry, let me take that back. As x changes positively, I expect y to change negatively. Or maybe if my theory is really, really good, I can even say, as x changes positively, y will change at the rate c. I know exactly the relationship. Every 1% increase in the interest rate will lower GDP by a quarter of a percent. That would be a really close prediction. And so what I'm going to do is I'm going to take those predictions, and then I'm going to estimate a linear model. And I'm going to say, all right, well, what I want to know is how much do I predict y to change as x changes? Well, what I need to do to figure that out is to take the derivative of my model with respect to x. And if I ever got a model that looks like this, beta was 0 plus beta 1 hat x plus beta 2 hat c. Well, I can figure out the derivative of that. It's beta 1 hat. So now what I'm going to do is I'm just going to compare, does beta 1 hat look like dy dx? That seems pretty easy. I have to point out the extreme lack of ambition instantiator crystallized in this comparison. We are so lax that we are satisfied if we can get the direction of a change right. We are absolutely not shooting for the moon here. Suffice to say, we are not expecting accuracy at the sixth decimal place. We're happy if we can get the plus or minus sign right. But even this relatively modest task, it turns out, is going to be pretty hard. Why is that? Well, let me give you an example. Suppose I estimate an empirical model and I find that dy hat dx equals beta 1 hat here, like I already told you. And let's say that equals c plus epsilon, that's what I find, where epsilon is a very small quantity. So I'm going to come in here and specify that epsilon is pretty close to 0. Now, if I find that, would I reject the hypothesis that beta 1 equals c? So if I predict, if I have this theory and my theory predicts, hey, every 1% increase in interest rates causes a quarter percent drop in per capita GDP growth over a year. And then I run some empirical tests and I find, you know, actually it causes a 0.26% drop in GDP growth per capita per year. What I say my theory is bad is we're thinking about, I personally would say probably not. I'm actually going to probably say my theory is pretty darn good. So the answer I'm going to give to that is probably not. And I think there are two reasons why I would give that answer. And I think they're perfectly reasonable reasons. Reason the first. The difference between c and c plus epsilon is not important. It's not important enough to matter. It's not large and important enough to matter. What I'm assessing here is the relationship's substantive significance. This is an assessment of the substantive significance of the difference between c and c plus epsilon. So what I'm saying here is, look, is there a difference? Sure. The difference in my prediction and my finding is epsilon. Epsilon doesn't matter to anyone. There we go. I'm as close as I need to be to make important, interesting predictions. So that's kind of good. But that's not the only critique I could level. In other words, that's not the only reason I could decide my evidence is supportive of my theoretical hypothesis. I might also say, look, our estimate of beta hat, beta one hat in this case, is imperfect and intrinsically random. And so, as a result, statistically, there is no difference between c and c plus epsilon. This is an assessment of the statistical significance of the difference between c and c plus epsilon. That's getting at a different thing than the assessment of the substantive significance. The substantive significance says, sure, that difference is there. The difference between c and c plus epsilon is there. But it's so small that it's not going to matter politically or scientifically, ergo, I can ignore it. An assessment of the statistical significance says, hey, I actually don't even know that there is a difference, because a difference of size epsilon could easily be attributable to randomness. Randomness, perhaps, in my sampling process, picking a lucky or unlucky sample gets me a little bit higher or lower than the true value of beta. It could be that beta is random. There are all sorts of reasons why our model may be giving us a slightly random estimate of beta one hat, and there you go, I can't tell the difference between c and c plus epsilon. So for that reason, I'm going to say that they are the same thing. What's important for our purposes is to understand that there is a difference between these two concepts, and as I'm going to talk about in a minute, most statistical hypothesis tests focus on the second difference as compared to the first difference. So as I just indicated, classical hypothesis testing, like the Tavasa testing we're going to cover today, focuses on a result statistical significance. Substitutive significance is widely recognized among political scientists and statisticians to be important, but its assessment is much more informal, more about scientific judgment, and involves the qualitative assessment of an effect's size and uncertainty with relation to its scientific importance in order to be established. Statistical significance, on the other hand, is about deciding whether a signal is strong enough such that we can rule out the possibility of its being generated by noise, and it's a more technical topic and, ergo, more amenable to formalized testing. So one reason one might ask why we need a test of statistical significance, why can't we just kind of eyeball it and maybe plot a confidence interval or something and figure out, you know, that's different enough. Well, our estimate of dy hat dx beta hat is random because the error term u, and that's the actual u, we're supposing an accurate specification of the LLS model, or the linear model here, so that's the actual u, not the estimated u, is random. So that error term is random, or possibly because the true beta is itself random, although this possibility is not a part of the classical exposition of hypothesis tests. So suffice it to say that typically hypothesis tests are presented through the framework of saying that u is random, which causes randomness in our estimate of beta hat. It is also conceivable that the true value of beta is a random value, and the results that we would get from assuming that are broadly similar to the ones I'm about to present to you, but that's usually not how it's presented. So randomness in u implies that our precise estimates of beta hat will depend on precisely what sample and consequently what values of u that we are examining. Now, a word about the use of the term sample. So one way of interpreting this framework is to say, you know, there's a large population of people, and we're going to conduct a survey, and so we sample a proportion of that population, and then we want to generalize from that sample survey results to the opinion of the larger population out of which that sample was drawn. And because each sample is slightly different from every other sample, our results are slightly different every time, and hence we have variation in opinion and sampling opinion that varies around the population mean. That is certainly one way to view what we're doing, everything works if you view it that way. However, it's not the only time when we can think about sampling profitably. So for example, in the international relations context, if we're looking at the international system between 2000 and 2010, it's not the case that we have a sample of that system typically. We can very often collect the entire comprehensive set of countries or interactions over that time period, and hence we're not, in a sense, sampling. On the other hand, if we think of the world as unfolding as the product of a random process, and by that I mean both the independent variables and dependent variables, the nature of them is partially a product of irreducible randomness, then if we were to hypothetically rerun history, sort of start over with different initial conditions or slightly different values of quantum spins of electrons, the world would unfold in a slightly different way and the data set we would get from that world would be slightly different from the data set that we get in the world that we actually live through. So we can interpret even quasi-fixed data sets or quasi-population data sets as intrinsically or implicitly coming out of a sample of possible worlds. And hence, if we view it that way, the use that we're realized in that unfolding of the world were partially random and if that world unfolded again, we would get different values of you. So we don't necessarily require random or sampling in the strict sense in order for all these results to apply the way we would like. Now, one question you might ask is, well, why do we even care that our estimate of beta hat is random? Why is that a problem? Well, let's take a look at the distribution of an estimated beta hat as the function of a true beta. So here's a distribution and I'm going to say, all right, here's what the true beta is. It's positive. This is the true beta right there. It's seen about one and a half. And because of the randomness of you, if we were to take multiple samples out of a population generated out of a data-generating process with this true beta, sometimes we would estimate beta hats that were really positive. We'll call that beta hat one. Sometimes we would estimate beta one hats or beta, I'm sorry, beta hats that were positive but smaller than the true beta hats. Let's call that beta hat two. And from time to time, we would even estimate beta hats that were smaller, not only smaller than beta, but actually smaller than zero in the wrong direction. So in this case, we'd get a beta hat that's actually negative when the true relationship is positive. This is troubling because what it means is that when we typically only get to look at one realization of beta hat. This is particularly or especially true in the cases I just mentioned like the international system where you can't just go out and get another sample of the world. The world has unfolded. We can wait for even more history to unfold, and we believe that causal processes change over time due to changes in the international system. That's not going to help you much. Maybe you can measure things a different way or measure a different aspect of predictions or something to get a second empirical test. But by and large, you're going to be pretty much limited to the sample you have. And that's kind of troubling because, well, if I get a positive beta hat, that could be consistent with the true world where beta is negative. I could get a result that says X and Y are positively related when they are in fact negatively related. I could get a result that said X and Y are really strongly positively related even when they're just a little related or near zero relation. That is troubling because what it means is that in a sense we can't trust our statistical results the way we might wish to because of the problem of sampling. So what statistics is about in some sense is finding a way to deal with that. That's why it's problematic. So if you believe it's going to be a problem, and I hope you do, I can go a little bit further and talk about the ways, the kinds of problems we might expect to face. So I'm going to start off by saying, or assume, or presume that we are in a world, I'm sorry, let me take that back, presume that we have a theory that's telling us, I expect to see a positive relationship between two variables X and Y. All right, so there we are. I have that relationship. Now, there are two possibilities, one where the state of the world actually matches our prediction, and one where the state of the world does not match our prediction. So in other words, in this sort of binary hypothesis testing framework, we can either be right or wrong. Our theory, our prediction can be right or wrong, accurate or inaccurate. An accurate or inaccurate reflection of the empirical, the nature of the empirical world. Now, when we confront this prediction with data, we can find two things. We can find an estimated beta hat greater than zero, and we can find an estimated beta hat less than zero. So there are two cases that are kind of nice, but not especially interesting because they're not problematic right now. The first is the case where the state of the world is that our prediction is a good, accurate reflection, let me clarify this a little bit. What we're asking is, can evidence be interpreted to be consistent with our theory? And in the first case, when we look out in the world, we find evidence in this estimation that the world is consistent, evidence is consistent with our theory, and that's actually the true state of the world. The true state of the world is that our theory is good, and it's useful. It's a useful predictor, or perhaps a useful explanatory framework. This is a case where we have a true positive. So the fact that it's a positive comes out of the fact that we are confirming our theoretical hypothesis. Then there's the case where our theory is wrong and our empirical investigation reveals evidence that is inconsistent with our prediction. So this is the true negative state of the world. But these two middle cases are the interesting ones for our purposes. The first one is the case where the world is actually consistent with our hypothesis, but we find evidence that is inconsistent with our hypothesis. So this is the false negative. If we were to do this test over and over, or to actually put a point right on it, if our tests were to reveal the true state of the world, we would find evidence that was consistent with our theory. But we didn't do the chance, so we have a false negative. This is also labeled as a type 2 error. Type 2 error. Then there's the case where the world is not consistent with our theory, yet we find evidence that seems to be consistent with our theoretical predictions. This is the case of a false positive, also known as a type 1 error. So these are the four things that can happen logically in a binary hypothesis testing framework. Two of them are problematic, two of them are good. And what we want to ask ourselves is how do we want to trade off the possibilities of good and bad outcomes? And that's at the heart of a statistical hypothesis testing by a statistical significance. So it turns out that statistical tests, generally speaking, are designed to minimize the chance of a false positive at the expense of an increased possibility of false negatives. Why is that? Well, there is a thinking behind it, and it's not unreasonable. The idea here is, look, it's better to ignore a correct hypothesis and the theory from which that hypothesis flows, and miss out on opportunities for increased knowledge and maybe better refined policy than to falsely accept a theory and as a result have misguided conclusions that cause destructive policy making and wasted research effort. And so the way this is typically presented is that social science has a conservative bias. If we're going to say something new, if we're going to advance a new theoretical hypothesis and say that it's been supported by evidence, I'd rather make a mistake and say that a hypothesis that ought to be supported is not supported than to make a mistake and say, yes, this theory is great, we should move forward with it and waste a bunch of time, effort, money doing additional research suggested by that finding and even worse, maybe have policy makers do things as a result of this finding that are false. And so it's presented as being conservative in a good way, or in a non-ideological way, I might say, in a sort of scientific way. But that's not the only way to view the set of trade-offs and I want to start talking about this with a really contrived example. Consider the problem of a pregnancy test, how one might design a pregnancy test. Now there are two kinds of mistakes in a pregnancy test, a false positive and a false negative. A false positive occurs when a test says that a woman is pregnant when she's not. What are the consequences of a false positive? Well, the consequences of a false positive are that a woman, first of all, might suffer some level of psychological distress if that pregnancy is unwanted. Perhaps will suffer an experience, not undeserved, but rather experience a let down when they eventually, if they want to get pregnant and find out that they're not, that might be sort of troubling to them. So there's some psychological damage. But beyond that, what it's going to prompt is probably a woman to stop drinking, to protect the fetus, maybe stop smoking, to protect the fetus. Perhaps consider going to the doctor, getting a prenatal visit, taking some vitamins. And of course, as a part of that, getting a more accurate blood pregnancy test that has a much lower probability of making any kind of mistake in a particular false positive. On the other hand, consider the consequences of a false negative on such a test. If a test says that a woman who is pregnant is not, if the test gives a false result in that direction, a woman might continue drinking under the false belief that she's not pregnant, thus causing damage to the fetus. She might say, yeah, a test says no, throw it in garbage, forget about it, and never get tested until three, four, five months down the line, all during which she's not getting prenatal care, she's not taking vitamins, she's not making adjustments to live a healthier lifestyle, not preparing financially for a baby, whatever. And so in the simplest case of designing pregnancy tests, probably false negatives are a lot more dangerous than false positives. So it's not necessarily the case that false negatives are always the best way to go, or are always the best thing to ignore. Sometimes false positives really ought to be ignored. So coming back to the issue of a statistical test, I think it's probably safe to say that generally a false positive may be more dangerous to the advancement of science and to the advancement of policy than a false negative. But we have to remember that false negatives do present opportunity costs. What I mean by that is we do lose out on the knowledge and policy advances that would be made as a result of that finding. And even more to the point, we need to think about exactly what level of false negatives we expect to see for a given level of false positives. And in fact, we can show their trade-off between power, the ability to avoid false negatives, and size, which is the ability to avoid false positives, and an explicit analysis. And we're actually going to do so not later in this lecture, but in the next lecture when we talk about size power trade-offs. So this is slightly the same, but this is for next time we're going to do that. All right, let's talk about hypothesis testing. So what's going on when we test the hypothesis? Well, what we're going to say is there's some alternative hypothesis, which is all that we're saying here is that there's some range of an estimated coefficient like beta that we are going to interpret as being consistent with our theoretical prediction. So if our alternative is that beta is bigger than zero, probably our prediction is, you know, there's a positive relationship between y and x. And then the null is simply whatever range of beta is left that's inconsistent with our theoretical prediction. In this case, the beta is less than zero. So what we're going to do is only accept the alternative, which is a way of saying only conclude that our evidence is consistent with the theoretical prediction if the probability that beta hat is given is greater than whatever we observe, beta hat zero, under the null hypothesis, which in this case the null hypothesis is that beta is zero, is less than alpha, where alpha is some critical value that gives the chance of a type one error slash false positive. Alright, let me talk a little bit about what's going on here, and I think it might be helpful to do that with reference to this handy little graph. So here is a graph that depicts a relationship, and as you can see, here's my null hypothesis. Now, one thing you might immediately notice is, wait a minute, up here you told me that my null hypothesis is that beta is less than or equal to zero. Now you're treating it like a point at zero. What did you just do? Well, I regret to report that this is a little incoherent, and ultimately the reason for it is that our modern method of testing hypotheses is a combination, a sort of shotgun marriage, of the name and Pearson view of hypothesis testing, and the Fischerian view, the RA Fischer's view of hypothesis testing. And the name and Pearson view tends to think about things in terms of these ranges of coefficients or parameters that are consistent or inconsistent with a particular theoretical hypothesis. Whereas the Fischerian view tends to think of point hypotheses. And so the product of this in terms of our modern method is that we, even in the case where we have these ranges of parameter values that are consistent or inconsistent with some value, we still end up saying, okay, suppose that the null hypothesis is true and that beta is actually equal to zero, which is a point. Okay. So just take that for what it is. The null is zero. Under the null, there are lots of times we will nevertheless observe large positive values of beta hat. So the null hypothesis is that beta equals zero, but nevertheless we're going to very often observe positive values of beta hat. And in particular, sometimes we're going to observe very large values of beta hat. But you can see the shaded area here gives the probability, it's the integral under the probability density function, so it's the cumulative probability, that we observe a beta hat greater than some particular value right here. And so the standard modern hypothesis testing paradigm says, well, what you should do is put your estimated beta hat right here. Beta hat. This is the beta hat observed. And then determine the area under the probability density under the null for the area under the curve to the right of your observed beta hat. This is for a one-tailed test. We'll talk a little bit later about two-tailed testing if you've heard of that before. And if that probability is less than some critical value alpha, we will conclude, you know what, the null is just not very consistent with this evidence. This evidence is really more consistent with our alternative hypothesis. The intuitive logic that's going on here is we're saying, if the null were true, how likely would we be to see the evidence that we see as instantiated or reified in a particular beta hat coefficient? If we see a beta hat, so I mean just taking this graph again as an example, if we see a beta hat, you know, way, way out here, could that have occurred under the null? Yes. How likely would a beta hat of that size be to appear under the null? Well, not very likely. The shaded area there might be 1%. So, not very likely. If we see that, if our hypothesis, our theoretical hypothesis is that the relationship should be positive, what we're going to say is, you know what, this finding is simply not that consistent with the null and we can safely reject it. Now, I mentioned that we compare this to an alpha value. Conventionally, alpha is 0.05, which means that if the shaded area in that graph is less than 5%, then we go ahead and accept the null hypothesis. So, just to re... I'm sorry, accept the alternative hypothesis and conclude that our evidence is consistent, more consistent, one might say, with our theoretical prediction than with the null. So, just coming in here to recap, here's the distribution of beta under the null. We put our observed value of beta hat here. We shade the area under the PDF to the right of that. If that area is less than a predetermined number alpha, then reject the null and conclude that this evidence is more consistent with our alternative hypothesis than with the null. Alpha is the probability of a false positive. So, as you may have inferred from the discussion, alpha is the proportion of the time that we would falsely conclude that the null was... well, falsely conclude the null fault. We would falsely reject the null. We would come up with a false positive under the world where beta really was 0. The probability of a false negative is never assessed at all. And you might not expect this, but it's actually not a constant function of alpha. It depends on the estimator, the size of the data set, and a variety of other factors. And we never talk about it in the hypothesis testing framework. It's simply not considered. So, the probability of a false negative is never assessed in this framework. And that is problematic. And we're going to return to that in the next lecture when we talk about size power curves. And I think it's safe to say that even if one accepts this hypothesis testing framework, it sure would be nice to know, given this framework, how often we would falsely reject a... how many false negatives we would get with this particular testing procedure. Now, one thing you might be asking yourself right now is, well, wait a minute, in that previous graph there, how is it that he knew that the distribution of beta hat looked like a normal distribution? Well, I actually haven't really told you anything just yet, anyway, that would lead you to that conclusion. You don't know that. So, I haven't told you anything to make that conclusion just yet, but I'm going to tell you now. So, it turns out that knowing the distribution of f of beta under the null is actually somewhat challenging. But we can write down the statistic for which we do know the distribution. So, I'm going to say here there's an alternative. We can write down the statistic for which we do know the distribution. And that statistic is the z statistic. Here it is, z. And z equals beta hat minus beta under the null. Actually, I'm going to rewrite that a little bit. Instead of beta zero, I'm going to write this as beta under the null. Divided by the square root of the variance of beta hat. And this lower part here is sometimes written as the standard error of beta hat. And the standard error of beta hat is the square root of its variance. So, that means the same thing. So, a beta hat is the, this is the estimated coefficient. This is the value of the coefficient of the null, and that's the variance in beta. You might recall from a previous lecture that the variance of beta hat is equal under the CLNRM assumptions to sigma squared X transpose X inverse. The variance covariance matrix of the OLS regression estimates. Under the assumption of homoscedasticity and, well, yeah, under the assumption of homoscedasticity. So, when calculating the z statistic for a particular coefficient beta, you should take the ith diagonal element of the VCV. So, for a particular beta i, the variance of that beta i hat is the ith diagonal element of the VCV. One thing you might notice here is that the z statistic is a standardization of the distance between the observed and null predicted, or, I must say, between the observed value of beta and the null prediction for beta. So, if you standardize variables before, and I'm sure most of you have, you tick a variable X and you subtracted it from its mean, and then you divide it by its standard error, and this process makes X, S have a mean of zero and a standard deviation of one. Well, if we go back up to the z statistic, that's precisely what we've done here. We've taken z, or taken beta, sorry, beta hat right there. We subtract it from its mean under the null hypothesis, which is typically zero. And then we divide by its standard error. And so, consequently, we expect X, S to have a mean zero and standard deviation of one, just like any standardized variable might have. The z statistic formula is the same. So now, we can actually prove that we know the distribution of z for any sample size, and I'm going to prove that to you right now. So what I'm going to do is just start out by writing down the z formula. So here we go. So z is beta hat minus beta zero divided by the square root of the variance of beta hat. Now, I am going to, for simplicity of X position, presume a regression with an X, a column vector for X, in other words, X is n by one. Now, I could do this X position for an n by k matrix of X, everything's going to come out the same, just the notation's going to get more annoying, and you'll see where in just a moment. So without loss of generality, I'm assuming n by one column vector for X. So now, what I'm going to do is come down here and write in z, and I'm going to make some substitutions. So the first thing I'm going to do is substitute something for ver beta. And in particular, I'm going to substitute sigma squared X transpose X inverse to the negative one half power, or in other words, one over the square root of that. So that I'm substituting in for ver beta. It's just the formula for ver beta. So this whole thing, square root of ver beta, is right here, square root of ver beta, one over that. Then I'm going to come in here and say, alright, this is all multiplied by X transpose X inverse, X transpose Y minus beta zero. And what I've done here is I've substituted beta hat for its matrix OLS formula, beta hat. There you go. So all perfectly reasonable substitutions. Now I'm going to start doing algebra and simplifying. So X squared to the negative one half power is, I'm sorry, sigma squared to negative one half power is sigma to the negative one. X transpose X inverse to the negative one half power is X transpose X to the one half. Now the rest of it here, let's see, X transpose X inverse, X transpose. Now I'm going to substitute something for Y. I'm going to substitute in X beta zero plus U minus beta zero. So what I've done here is invoke assumption one. So actually I should use another color. So here we go. Here's beta zero. No, no, no, no, no. Oh, nuts. Black, green, Y, so this is assumption one from the CLRM assumptions. And what I've done is I've just said, you know what, if this model is appropriately specified then y equals x beta plus u. And this is where I'm really invoking that thing I mentioned earlier where I'm assuming x is in by 1. This just makes it easier to write that part of it down. Okay, let's start simplifying. So I've got z equals what? Sigma to the negative 1, x transpose x to the one-half, one-half times x transpose x inverse. That should be an inverse there, inverse. x transpose x beta 0 plus x transpose x inverse x transpose u minus beta 0. Okay, let's see, I need another parentheses right there. This is i, this simplifies to 1, so I can just eliminate that and write sigma to the negative 1, x transpose x to the one-half, and that's dead. So all that's left is beta 0 because that turns into 1. Now I've got x transpose x inverse x transpose u minus beta 0. Well, I've got a beta 0 minus a beta 0, so I can kill those too. What am I left with? Well, coming down here equals sigma to the negative 1, x transpose x to the one-half, x transpose x to the negative 1, x transpose u minus, oh no, sorry, beta 0 is good, so that's all I got left. Now a thing to the negative 1, or I'm sorry, a thing to the one-half power times a thing to the negative 1 power is a thing to the negative one-half power, x transpose x to the negative one-half, x transpose u. Okay, so that's what I've got, which is great. The question is, what have I got and what does that teach me? Well, okay, here's what I've got. I've got sigma to the negative, or 1 over sigma, sigma to the negative 1. For the time being, I'm going to consider that a constant. Under the homoskeasticity assumption, there is a constant variance, and so that's not absurd. I'm going to assume that all the x components are non-stochastic, which is a part of the CLRM assumption, so that's fine. That only leaves me with u, and u is definitely random. As an error term, it is intrinsically and irreducibly random. If I want to know something about the distribution of z, because only one thing in z is random, and the distribution of ax, where a is a constant and z is random, equals a times the distribution of x, I need to make an assumption about the distribution of u, so I need to make an assumption about the distribution of u. I'm going to assume that u is normally distributed with mean 0 and standard deviation sigma squared. If this assumption is true, z is distributed normally with mean 0 and standard deviation of 1, for the same reason that any standardized variable has a mean 0 and a standard deviation of 1. When we make this assumption, we get the classical normal linear regression model. So up to this point, the assumptions we've invoked have really not made any reference to the distribution of an error term at all. But now we needed to make an assumption in order for us to be able to figure out what the distribution of z would be in small samples, this normal. So we had to add an assumption about the normality of the error term. Adding that assumption gets us into the CLNRM world. What's important to understand here is that this implies, at least in small samples, there are large sample results that don't rely on this assumption, but for the time being in small samples, the z-statistic and all tests that flow from similar ideas depend on CLNRM assumptions. Hence, hypothesis tests can be misleading if these assumptions do not hold. And you'll notice, you may have already noticed, and if you haven't already, you soon will, in applied papers and on projects, there's often a concern with exactly how a particular error term is distributed. Is it correlated with the regressors? Is it normal? And that's really important because if those assumptions are violated, we might have a problem testing hypotheses out of that model, because as I've just shown, the hypothesis testing framework depends on those assumptions. You might have also noticed that a lot of the results we've invoked up to this point, for example, the unbiasedness of beta, or perhaps the fact that OLS is the best linear estimate of the conditional mean. None of that crap relies at all on the distribution of u, or even in some cases, even the existence of u. But the more we want to say, the more assumptions we have to make, and the more sensitive our results become. So if we want to do hypothesis testing, unfortunately, we're going to get into a world where the assumptions we make become, there we get more of them. And violating those assumptions causes us problems, particularly with respect to the new results we get out of them. So that's why hypothesis testing, or the efficiency, so to speak, of an estimator is often one of the first things that gets assessed or criticized because it's a canary in the coal mine with respect to violation of assumptions. If assumptions get violated, it's quite common to expect that the standard error of some estimate and constantly the hypothesis that come out of that estimate are going to be too big, too small, or bad for some reason. So, there you have it. Now there's a little problem with the z-statistic, and it's subtle enough that may have slipped under your radar, but it's there. So remember, when I wrote down the z formula, it's sigma x transpose x to the negative one-half x transpose u. Actually, I'm sorry, it's sigma to the, it should be sigma to the negative one, you know, one over sigma. I don't actually know what sigma is. I didn't write sigma hat, I wrote sigma. And if I don't know what sigma hat, or if I don't know what sigma is, I can't necessarily calculate z. I can estimate sigma. So I can say, you know what, my estimate of sigma is going to be given by a square root of sigma hat squared. And as you can, as you may recall, our estimate for sigma hat squared is one divided by m minus k u hat transpose u hat, where n is the number of observations, k is the rank of x, or the number of variables in x, including the constant, and u hat transpose u hat is the sum of estimated squared errors. And, you know, normally when we estimate a variance, we divide by one over n, where n is the number of observations, but we divide by a smaller number because u hat systematically underestimates u as a consequence of the fact that OLS minimizes errors, estimated errors. And so as a result, we need to inflate our estimate a little bit, and that's what the n minus k term serves. If we substitute sigma hat squared into the formula for z, what we get is something new. We get the t statistic. X transpose x to the negative one-half, x transpose u times quantity x transpose u, times one over n minus k, u hat transpose u hat, all, I'm sorry, not transpose, should be a hat, all to the negative one-half power because we want sigma, not sigma squared, so we need to divide by the square root, not by the, by the wrong number. That's the t statistic. And it too depends on the classical linear normal regression model with emphasis on the normal. So the t statistic actually has a distribution that's just a little bit different than the z distribution, although, or in the normal distribution, although they're quite closely related. There's some theorems here, the distributional theorems from probability theory that I'm simply going to cite and move on without the laboring. Feel free to take a probability class or even read more carefully through Davidson and McKinnon if you'd like to see these things proven. But it turns out that any quantity x over y where x is a unit normal distributed, where x is distributed via the unit normal, which means zero and standard deviation of one, and y is distributed chi squared m. And as long as x and y are independent, the quotient of those two things takes a t distribution with degrees of freedom m. Okay, what the heck did I just say? All right, so take a look at this t formula that I wrote out for you. As I've already shown you, the red part is distributed normally with mean zero and standard deviation of one. That is the normalized z statistic bit, right? It's the normalization. You're dividing, or you're demeaning and standardizing x. The blue quantity, sigma, I'm sorry, the blue quantity you have transpose you have is distributed chi squared, that should be chi squared in there, squared, nope, not cubed, squared. Because any quantity of the form v transpose v, where v is an m by one vector, is distributed chi squared m, if v is itself distributed normally with mean zero and standard deviation one. So here we have the normal part, the n, in C-L-N-R-M, coming to Roost again. We need it here too, because we need the errors to be normally distributed in order for the sum of squares of errors to be chi squared distributed, which in turn allows us to say that the t is t distributed. Okay, so what is the chi-square distribution? You've probably seen it before, and there are lots of contexts in which it arises. I'm not going to really cover all those contexts, but I do want to show you, you've probably heard at least of one case, there's a chi-square test that's distributed according to the chi-square. I just want to show you real quick a few chi-square distributions that are in your lecture file for this week. Here it is in the lower left-hand corner. There are three of them here. The first one is for degrees of freedom, m of two. The dotted line, I'm sorry, the dashed line is for degrees of freedom of three, and the dotted line is for degrees of freedom of seven. Remember, m, where the degrees of freedom is equal to n minus k, the number of observations minus the rank of x. What you'll see here is that as the number of observations relative to the number of variables grows, we expect a larger mean distribution of the sum of squares of the variables, which kind of makes sense. The more observations you have, even if those observations are mean zero, the fact that there's noise and the fact that you're squaring the distance from the noisy parameter to the mean is going to get you bigger and bigger sums as you go. Well, that's what a chi-square distribution looks like. I'm not exactly sure what we learned from that except that if they feel except first that the error terms, the sum of squares of the estimated errors are distributed chi-squared, and that enables us to say that the t statistic is distributed according to the t distribution. So t also has this degrees of freedom argument equal to n minus k, and I plotted some t curves for you here, starting with a degrees of freedom of one and then getting bigger to a degrees of freedom of 30. And what I hope you will observe here is that the t looks a lot like the normal distribution. It's not a mistake. The t is very closely related to a normal distribution but with fatter tails. Why are the tails fatter? Well, consider any particular observation a beta hat observed. It should be easier to pass a hypothesis test with more evidence because the more evidence you have, this larger your sample size, the less variation we expect to see around the null, around the true state of the world, wherever that may be, and then observing a large positive deviation, or that matter, a large negative deviation, from the null hypothesis of zero becomes a more and more certain indicator that that null hypothesis is probably not right. In other words, the evidence becomes less and less consistent with the null hypothesis the more evidence you have for a fixed quantity beta hat. So you can see that the sum of the p value, which is just the integral of this pdf to the right of beta zero, gets smaller and smaller as the number of degrees of freedom rises under the t-distribution. In fact, as n, the number of observations approaches infinity, the t-distribution looks more and more like a normal distribution with mean zero and standard deviation at one, which implies, not entirely surprisingly, that in a very large size sample, t-tests and z-tests are the same. There is no difference. And ultimately, that relates back to the fact that the larger the sample is, the more accurate our estimate of sigma is, and so the better the z-test is as the hypothesis test. So all this work has been toward the end of getting us to this point, the point at which we can establish a standardized procedure for testing hypotheses. So what all of what we talked about boils down to is the following. To conduct a test for the statistical significance for a particular beta-i coefficient, if you're willing to accept the classical linear normal regression model assumptions in a small sample, one can calculate the difference between the observed and null values of beta hat divided by the variance of beta hat and get the t-statistic. Use that t-statistic to calculate the area under the t-curve, which we would call p-value, right there. So the area under the curve for the t-statistic area under a curve, oops, I'm going to erase that. Area under curve for t-beta-i is the p-value. In other words, this probability here is the p-value, compare that p-value to alpha. If p is less than alpha, reject the null. If p is greater than alpha, fail to reject the null. That's the standard kind of procedure. So this means we can effectively compare, I'm sorry, let me just see over that. If alpha is 0.05, then the critical t-statistic associated with that alpha is 1.645, which means that we need to see a t come out of our observed beta bigger or equal to 1.645 in order to reject the null. If alpha is 0.025, then t is 1.96. Now, why would we cut alpha in half? Well, the logic for doing this, which is called two-tailed t-testing, is that coming back to comparing our, I'm sorry, the distribution of beta hat here, what I said before is, hey, what we want to do is take our observation of beta and figure out how likely we would be to see a beta hat of this size or larger under the case where the null was true. But what we might do is say, well, how big would we see, how often would we see a beta hat of this absolute size in either direction in the case where the null is true, which is just a way of saying how likely are deviations of this absolute size under the null? Practically what this boils down to is you cut the one-tailed alpha in half and put 2.5% in each tail. That's a two-tailed test. So if the p-value is less than alpha or if the t-value for the particular beta coefficient you've estimated is bigger than the critical t, then your beta coefficient is statistically significant. As n goes to infinity, t-testing is equivalent to z-testing. So stata and R always report t-statistics, and when you have a large sample, they just work out to be z-statistics. So now what I want to do is just do a couple of quick examples of how one does this on a day-to-day basis in R or stata, two of the most popular programs for use among political scientists. So in your lecture file, let's skip over that for a second. I've got a couple of simple hypotheses that you can test. So what I've done in here is said, all right, I'm going to create a fake data set. The moment. Continuing with that example. So what I'm going to do is just clear out the memory. Use the foreign library. Actually, do I have the foreign library installed? Yeah, I do. I'm going to set the random seeds so that we all have the same random numbers. And I'm going to draw a few variables out of a degenerating process. First, I'm just going to take 100x's from 0 to 10 over the uniform distribution. And then I'm going to say that y is 2x plus 1 plus a normally distributed error term with mean zero and standard deviation one and a half. I'm also going to draw two irrelevant variables, w that have just draw 100 of those from the uniform distribution from 0 to 10. Their usefulness will become apparent later. Then I'm going to bind all these into a data frame called lecture5data.dta and save it. Then I'm going to clear everything out and read that data in again and attach it. So that's where you would start from if you were sort of doing this from scratch. Now if I run a linear model and test and calculate the coefficients and standard errors of that model as shown here, what I get is here's my estimate of the constant. Here's my estimate of the coefficient on x. Here are the standard errors of those two things and here are the t-statistics for those two things calculated in the way that I just outlined. And for the last column here are the p-values associated with these two coefficients. So what this is telling us is that we would see a coefficient of size 0.777 only about 0.6% of the time if the true intercept were 0. If the true x coefficient were 0 we would see a coefficient of size 0.2019 percent we would see a coefficient of size 0.2019 very, very infrequently. 2 times 10 to the negative 16th proportion of the time. So basically never. Remember that these are two-tailed p-values so that means that the probabilities that are being assessed are in terms of the absolute magnitude of the coefficients not in one direction. If we wanted to look at the one directional or one-tailed significance test we would just cut those p-values in half and compare them to an alpha whatever alpha we blight 0.05 would be an obvious choice. I could also load this data in stata. Here's stata 11. So if I come over here and I'll actually the easiest thing to do is probably just to double-click on the data file that brings it up like so. So here's the data file with variables y, w, x, and z just as promised. And if I just type in regress yx I should get... It doesn't look right. All right, I had an earlier copy of that data file saved from an earlier run. So now let's try that again. If I regress yx I get the exact same coefficients or very, very similar coefficients that the ones I got under R to compare 2.01924. Here's 2.01924 and then 0.77706 and 0.777057. So that's the same one. Similar t-statistics here, similar p-values. Everything's pretty much the same. And that's not surprising since they're the same method. So similar procedures in either package and similar results from either package. So I'd like to wrap up today's lecture by talking about how one might test this statistical significance of multiple coefficients at a time. So what if multiple beta values or maybe all the beta coefficients simultaneously are tested versus statistical significance? So what I'm going to do is say there's a model with two blocks of variables, x1 and x2. x1 is n by k sub 1, x2 is n by k sub 2. And my null hypothesis is that beta 2, which in this case is a vector, 1 by k1, no, it's k1 by 1, my apologies. All of the elements of that vector are equal to zero. And the alternative hypothesis is that at least one element of beta 2 is not equal to zero. So what the F-test does is compare the residuals from two different regressions. Regression one being a regression y1 equals x1 beta 1 hat plus x2 beta 2 hat plus v. And regression two being a y1 hat equals x1 beta 1 hat plus u. So this is sometimes referred to as the unrestricted, I'm going to try that again, the unrestricted regression. And this is the restricted regression because we are restricting, restricted regression because we're restricting beta 2, t equals zero. So in this framework, the F statistic is as follows. F degrees of freedom t n minus k equals u hat transpose u hat divided by n minus k sub 2 over v hat transpose v hat divided by n minus k1. Or alternatively one can write this as u hat transpose u hat minus v hat transpose v hat divided by r all over v hat transpose v hat divided by n minus k. And here r is the number of restrictions. And this k right here, right there, is the degrees of freedom of the unrestricted regression. Actually I should specify that n minus k is the degrees of freedom of the unrestricted regression. Alright, so that's the F statistic and you'll notice that there are two degrees of freedom here. The first one is, actually yeah, I should write that as r. r is the number of restrictions, not t. Apologies for that too, there we go. r is the number of restrictions. m minus k is the degrees of freedom of the unrestricted regression. And this statistic depends on the classical linear normal regression model assumptions because we need both u and v to be normally distributed in order for their sum of squares to be distributed to chi-square. So we've got two chi-squares divided by each other and the ratio of two chi-square statistics gives the F statistic, is distributed F. So here's what the F statistic distribution looks like for various values of the number of restrictions. And as you see the greater the number of restrictions, the more the F distribution shifts to the right. And this F test is quite easy to execute in stata in r. Actually, stata does this to a degree automatically. It'll automatically test whether all your variables in your regression are simultaneously equal to zero except the constant. That's given right here. F198, so one restriction, 98 m minus k. The p value for this F statistic is zero, which means that the probability we would see this F under the null hypothesis that all the coefficients except the constant are equal to zero is tiny. So in other words, this coefficient is way too big to have occurred via chance. If we add in some irrelevant variables, w and z, the F test still gives us a p value of zero. So now we have three restrictions, x, w and z all restrict to equal zero. And we still get a p value of zero on the F, meaning we can reject the null that all three are equal to zero. And that's true because x is equal to zero or is not equal to zero. We know that because we made up this data and we know that to be true. If we pull x out though, ah, now we get a p value for the F statistic of 0.75. So it's likely, in fact, we know that w and z are irrelevant. So we're quite likely to see coefficients of these two sizes in the case where they're actually equal to zero. And in fact, we know they're actually equal to zero because we generated the data. I should mention the F test is intrinsically a one-tailed test, much like the chi-squared actually. And so what we're doing is saying, all right, we're going to pick the F value from this distribution that gives us an area in the right-hand tail of 5% or whatever we choose alpha to be. And then we're going to compare that critical F to the observed F. And in this case, the observed F is 0.29. That's way to the left of the critical F statistic for this particular case. R does F statistics too. So let me show you some examples of that. Here we go. So suppose I want to conduct an F test of multiple restrictions. I'm going to run a linear model here with y as a function of w, x, and z. And now the way to run an F test in... Oh, I'm sorry, I need to rerun this model again here. The way to run an F test in R is to run it as an ANOVA, in the ANOVA framework. So you just run two models, one that includes the restriction and one that doesn't include the restriction. So model one is the restricted model, model two is the unrestricted model. And then we say, all right, are these restrictions rejected by the F test? Well, as you can see, here's the F statistic, 0.2. And here's the p-value for that statistic, 0.81. We can not reject the null hypothesis that the restrictions on w and z coefficients to zero are justified. So we're going to accept the null that they are justified. Remember that the null of an F test is that the coefficients, the restrictions are valid. All right, so that's hypothesis testing for this week. We're actually going to, you know, in some ways continue with hypothesis testing for a long time, but this should get you started. I hope you enjoyed the lecture and I'll talk to you soon.