 Hi, this is Dr. Justin Esri. This is week four of PolySci 509, the linear model. And today, what we're going to talk about is what properties we can derive from or we can infer about ordinary least squares regression when we presume that OLS is an accurate reflection of the data generating process, whatever that may be. What am I referring to here? Well, up until this point, everything we've talked about, OLS, everything we've said about OLS, all the properties were derived by OLS, have been fairly agnostic about whether OLS is an actually good or accurate reflection of the data generating process, which is to say that all the properties you talked about hold regardless of the underlying data generating process, which is a very powerful thing because it means that everything we've said is useful, even if we know OLS is kind of wrong, but we're using it as an approximation to that underlying EGP. To give you a couple of examples of things that are true, even when OLS is not a reflection of the underlying data generating process, we know that OLS will fit a line that minimizes the sum of squared-estimated errors you transposed or you transposed you had, which is just a technical way of saying that ordinary least squares of regression will put a line that fits the data that minimizes the degree of estimated error, no matter what the underlying data generating process is. Now, those errors could be big. The f-fitted line could do abuse to our understanding of the data generating process, it's all true. Nevertheless, it'll be the best we can do with a straight line or a plane or a hyperplane. Number two, and sort of another way of saying what I said before, x beta hat from an ordinary least squares regression provides the best linear error maxim, the best linear error minimizing approximation to the expected value of y given x. So the data generating process is some way of some function that maps x into y with an expectation, right? So if we have any particular value y zero, the DGP has some noise in it, but it also has a signal and the signal component is what would we expect to see for that data set given an x value. This data generating expectation, the expectation of the data generating process may well not be linear. It's probably very often not gonna be linear. OLS will give us the best linear approximation to that process. Sometimes this is stated as OLS is the best linear approximation to the conditional mean. So even when it's wrong, OLS is right, sort of, in a sense. But now we're gonna go further. Now we're gonna say suppose OLS is right. And what I mean by that is suppose that the world is a linear model, the data generating process in a linear model, y equals x beta plus u. Maybe we don't know exactly what x beta should be, but we know it's something and we know that it's linear. In fact, the world is, in some sense, a linear model. If that's true or it's mostly true, close to true, I should say, then OLS has some very attractive properties when applied to data generated out of that process. So here's how we're gonna proceed today. We're gonna talk about a couple of properties that are true of OLS in this world where the data generating process is linear. And the way we're gonna demonstrate to those properties is we're gonna make some assumptions about the world. And these assumptions are unproven. They're taken on at face value. The results that we'll get are a combination of these assumptions and of the logical rules of mathematics. And these assumptions effectively codify our statement that we are going to presume as though the world is some kind of linear data generating process and operate accordingly. And if we're right, all sorts of wonderful things will happen and those wonderful things are the things I'm about to unfold. Five of the assumptions are thought to be the most important because they are the minimal set of assumptions from which the best known results flow, which is just to say there are certain properties of OLS that often get cited and are thought to be especially important. And those properties derive from five assumptions that compose the classical linear regression model. So let's talk a little bit about what those five assumptions are and start proving results with those assumptions. All right, so here are the classical linear regression assumptions. The classical linear regression model is defined by these five assumptions. The first one is kind of the one that we've already stated which is right here, y equals x beta plus u. This assumption can be thought of as a combination of three sub-assumptions or minor assumptions. The first of these is that we have the correct specification of x. So for many of these results to hold, it's gonna be, we're gonna not only need to know that the world is linear, in other words, that it follows this relationship, but we need to know the components that go into that relationship. If we don't know some of the components of the relationship, if we omit some aspects of x or we put in just a linear x when we really should have x and x squared or the logarithm of x or something, we might make a mistake, we might do something wrong. And what I mean by that is some of these properties that we're about to unfold may not hold. Secondly, it implies that the world is linear. So it's not just we know what covariates influence y, we also know that they influence y in a certain way and that we can write that way down. Now by linear, I mean it can just be written in the form of a linear equation. So this is also, and now I'm gonna use the non-matrix form just to illustrate this. This is also a linear specification of x even though it has a x squared, or a linear relationship between y and x even though it has an x squared term in here. The point is we can write this specification as a linear polynomial. So that's a linear specification too. One way of thinking of this specification issue is we need to be able to write down a polynomial like this that can accurately represent the relationship between y and x. And the final aspect here, or the final sub assumption is that there are in principle fixed and constant beta values. Now that doesn't mean that our estimates of beta, beta hat are going to be fixed and constant. In fact, they are gonna be variable and we're going to quantify that variability a little later in this lecture. But the actual data generating process, the true world we're going to assume for the purposes of this demonstration, the true world has fixed betas and if we had the right kind of information, they could be determined they exist. So that's assumption one. Assumption two is that the expected value of the error term equals zero. Now you can see I've made a note over here that this property and the next one I'm about to talk about are properties of U, not of U hat. So in a previous lecture I told you that it was going to be true that the average value for U hat was gonna equal zero no matter what the underlying EGP. And that's right because this is a property of the ordinary least squares estimator not of the world. This is a property that we are assuming about the world. So not only is it the case that the world is linear and it's composed of a combination of a linear function of X and an error term U, but furthermore we're assuming that the average value of that error is zero, which is just a kind of a way of saying that it really is error not signal. It has no mean influence on the dependent variable. It just causes deviation from the constant expectation of Y that's a function of X. Related assumptions are these assumptions in part three here. First, we're going to assume that the expected product of Ui, Uj or the covariance of Ui, Uj equals zero. And this is true for i and j being any two observations in the data set or in the world. So if I choose any two observations at random, the covariance between their error terms should be zero. In other words, they should have an expected product of zero. The second sub assumption here is that the expected value of U sub i squared should be sigma squared or of the variance of Ui. This is sometimes called the homoscedasticity assumption. Homoscedasticity, there's a word you can use to impress your friends. Homoscedasticity just means that the variance of the error term is constant across observations. Variance of the error term is constant across observations. So combined, these two assumptions tell us something about the error term. In particular, they're trying to state in a formal way that U is an error term, it's error. It's not correlated with anything else in particular. It's not correlated here with other values of the error. And secondly, it's got a constant variance. The variance doesn't blow up and go down all over the place. It's just a sort of white noise term where the level of noise is constant. Fourth, we're gonna assume that x, the set of independent variables that we're working with, is non-stochastic or fixed. What that means is that we're gonna treat the data set as though it was the only possible realization of the independent variable world. In other words, there was no sampling process going on in the independent variable. And I'm just gonna leave that as for a second. I'll return in a moment to it. And then finally, we're gonna assume that the rank of the x matrix, which has order n by k, has a rank k, which is to say that there's no perfect coloniarity among the independent variables. That's one way of putting this assumption. Another way of putting this assumption is that the columns of x, so all the variables should specify, this is all independent variables, some more space here, are linearly independent from each other. So we shouldn't be able to take any combination of elements of the x matrix and combine them in linear fashion and get another element of the x matrix. And we've mentioned this before, but the most common way of doing this wrong is to fall into what's sometimes referred to as the dummy variable trap. So let me give you an example of the dummy variable trap. Suppose you've got in your dataset three regions. You've got region one and region two and region three. And what you wanna do is put in a dummy variable for each region. So we're gonna put in a variable x that equals one when we're in region one and zero otherwise. We're gonna put in a variable z which equals one for region two and zero otherwise and a variable w which equals one for region three and zero otherwise. If we go into r and we try to, suppose we have some dependent variable y, so that we're gonna say the dv is y. And we tried to write down the model y is a linear model of x, z and w, x plus z plus w plus a constant term. So the constant is assumed in any r model like this one. This model is gonna crash. And the reason it's gonna crash is that x plus z plus w adds up to the constant vector or one. So if you come down here and you think about writing out this dataset, so I've got x, z, w, and then the constant. You know, there are some observations for region one and then some that aren't. And then there are some observations for region two, let's say these two and others that aren't. And then there are some observations for region three and that'll be the rest of these down here. And then the constant is the same. One, two, three, four, five, six, seven, eight. For all of the observations, well I can take x plus z plus w and get the constant. And that is not so good. We have violated this assumption if we do that. Not only that, but as we showed a couple of weeks ago, OLS estimates cannot even be obtained in this case because you have more vectors in observation space than can be supported in that space. It's not a proper subspace of the space to reference some earlier terminology. Now, so we've got these five assumptions. And what we're going to do in the next bit is we're going to use these assumptions as tools to demonstrate certain properties of regression, of OLS regression. We won't need all of the assumptions for every property. So not all properties depend upon all the assumptions. And that's good or interesting because what it means is that not all of these assumptions need to be met for us to be able to say some things about ordinary least squares regression. And that means that sometimes we need to be more worried about the fragility of some of these results than others of them. Some of the results are less dependent on a large set of assumptions and therefore are more robust to the kind of peculiarities we might see in a live dataset. Furthermore, some properties can be sustained even under a relaxation of some of these assumptions. So what I mean by this is that sometimes we can find a way to say, well, this proof relies on assumptions one, two and four, but what if assumption four wasn't true? What if we relaxed that a little bit? We might still be able to get that property out of the model if maybe we make a slightly different weaker assumption. So there's some flexibility in when these things can be true. And then eventually in the next couple of weeks, what we're gonna find out is that OLS may have different properties, different and known properties under violations of these assumptions. And so eventually we're gonna be able to assess the consequences of violating these assumptions by saying, well, what if this assumption isn't true? What can we expect to see from our model? And in some lucky events, we may even be able to fix the problems by doing something that patches the data up and makes the assumption more true. That's probably most prominently true for assumption number two and three, problems with the error term. There are lots of cases where there's some undesirable property of the error term in a data set or where we at least suspect that it's likely. So these assumptions aren't true and we can't appeal to them, but we'll know what's going to happen when they turn out not to be true and maybe we can even patch things up so that the thing that happens is not all that worrisome for inference. But we're getting a little bit ahead of ourselves. So let's first talk about some things we can prove in the happy event that all these things are true. Okay, we're gonna start with perhaps the best known property of OLS, at least best known from the perspective of those who have taken graduate statistics and econometrist courses. That beta hat produced out of our famous formula, just to reiterate that formula, X transpose X inverse, X transpose Y equals beta hat. The beta hat produced by that formula is an unbiased estimate of beta, that is to say the expected value of beta hat equals beta. This is kind of the baseline assumption that many students of OLS know that are able to prove I'd like you to be able to remember and be able to recapitulate this proof when you need to do so on an exam or something. And it's a very simple proof as long as we're willing to rely on the assumptions that we made in the previous thing and the previous slide. So let's talk about what the proof of this theorem looks like. So I'm gonna start with beta hat equals X transpose X inverse, X transpose Y, and I notice I wrote X transpose that there should be inverse. Okay, now let's calculate the expectation of beta. The expectation of beta is just the average. And expectations obey certain rules, which will become apparent as we start to use them here. The first rule they obey is that if any variable is non-random, then we can assume immediately that its expectation is equal to itself and that the expectation of it and the product of it and any random variable is equal to that fixed variable times the product of a random variable. So let me just write down where I click what I mean by this. If I'm gonna take the average or expectation of AX, where A is fixed and X is random, I can immediately simplify this to be, whoops, I need to write the X first, the A first, A, E, X. And this actually derives from the definition of an expectation. So the definition of an expectation, definition is similar to the definition of an average. For a continuous X, what we would do to take an expectation is take the derivative or the integral from negative infinity to infinity of A to the X DX. Oh, I'm sorry, forgot a very important part. A to the X, F of X DX, where F of X is the probability distribution or density function of X. So in other words, this is a probability-weighted average of X, AX. Now you'll notice that what's going on here is that the only thing that's being integrated is X and F of X. A is floating around here on the outside. And it's a property of integrals to be able to say that this equals A integral X, F of X DX. And so this statement, this whole thing, is equal to this. So now, simultaneously, or as a consequence, we can say that this, which is this, is equal to this, which equals this. So that's a little bit of math background and working with expectations for you to get a sense of what's going on here. This expectation operator is a lot easier to write than to write out this integral for calculating expectations of continuous variables all the time. So we're gonna make frequent use of E whenever we need to take expectations, that is to say weighted averages. So let me, actually, hold on. So I'm actually gonna take this stuff and I'm gonna move this down here into a new section called properties of expectations just so you have it for your own reference if you need it. So I'll take this down here into properties of expectations. I'll just throw that right there so that you always have it if you need it. Perfect. All right, now getting back to the proof at hand. So this quantity, I'm gonna say this here is all fixed and only this part is random and I can invoke assumption number four to do so. So this is a move I'm making as a consequence of, actually I'm not gonna write assumption now every time, of A four, assumption four. This is enabling me to do this step here, which is a little wrong there. X transpose X inverse, X transpose the expectation of Y. So that move I just made, I made as a consequence of assumption four. Now the next thing I'm gonna do is substitute in what Y is. Now from assumption one, going back up to my assumptions here, I know that Y equals X beta plus U. So I can make use of that assumption in saying that this equals X transpose X inverse, X transpose the expectation of X beta plus U. That is true as a consequence of assumption one. So far, so good. Now, I'm gonna need to move this down a little bit. Okay, now what I'm gonna do is say, all right, I've got two assumption or two quantities that I'm adding together here. It turns out, and I can come down here to my properties of assumptions that go down. If E X plus Y, where X and Y are both random, is equal to E to the X plus E of the Y. And E A plus X, where A is fixed and X is random, equals A plus E to the X. Actually, it might be helpful for me to kind of write there equals and then put this up here. So you know, move this over here. There we go. So in my case, the relevant thing is gonna be this one that relevant properties of X is one, that relevant property is gonna be that one. And I know that because both aspects here are fixed. X is fixed and beta is fixed. Beta is fixed as a consequence of assumption one. We have constant beta values that are the true values. Maybe we don't know them, but they're there. U, on the other hand, is gonna be considered a random variable as a feature of the fact that it's an error term and therefore it's noisy. It is noise. It captures, designed to capture noise. So by the properties of expectation, I'm gonna be able to write X transpose X inverse X transpose X beta plus X transpose X inverse X transpose U. So all I've done is taken, oh, I'm sorry, I forgot the expectation here. I need to put in that expectation right there. So all I've done here is taken E to the X beta plus U equals X beta plus E to the U. And then I've taken this bit right here, this matrix pre-nultification and distributed it among the terms as is consistent with matrix algebra. So this is what I get. And this is actually pretty cool because I'm pretty close to having a result now. So first thing I can do is say that matrix times this inverse equals one. So I'm gonna be able to just kill that right there and just get beta. The next thing I'm gonna be able to do is say, well, the expectation of U according to assumption Brr, two is equal to zero. The expected value of the error term is equal to zero. It's an error term. So by making use of assumption two, I can now say this is just beta. The expected value of beta hat equals beta proof complete. So there you go. By making use of a couple of assumptions, we were able to demonstrate that OLS estimates are expected to be an accurate representation of the true beta. And remember that expectation as we discussed here is what amounts to an average or some kind of weighted average. So what this is telling you is that on average, in repeated samples, for example, if you're thinking about this in a frequentist framework, in repeated samples, we would expect running OLS on each one of those new samples of data we're taking every time, we would expect on average the beta hat to equal beta. That doesn't necessarily mean that any particular beta hat that we get is equal to the true beta in that data set. Not true at all. What we say is that this method is one that if we did it over and over again, on average, you would get the right answer. So you can think of this as a little bit like shooting a gun at a target. So now I'm going to inflict a terrible drawing upon you. So suppose that I've got a target, right? So I'm shooting at a target with concentric circles, it's concentric as I can get them. Here we go. And right in the middle, the target is the true beta. That's what we're shooting at is that true beta. Now, any particular estimate of beta, so like this right here, might not be the right beta. But if we were to take a new sample of data out of the same data generating process and estimate a new beta hat, we might get something else. And we might get something else if we did that again. And if we did it again, and we did it again, and we did it all over and over and over again. If we did it over and over again, what we should see is that although there's variation in exactly how accurate it is, on average our beta hat estimates that come out of each one of those samples from the data generating process on average are hitting the target. Now, a question that immediately follows from this is, well, how spread out will be the distributions of shots at that target? You know, it's much better to have an estimator that has a very tight distribution than a spread out crazy one. That's true. And in fact, in a couple of slides, we're going to get to a point where we're talking about that. One more thing I wanted to note about this proof before we move on. I wrote next to the lines of the proof each assumption that we used is a part of this process. We used assumptions 4, 1, and 2. So in order to make this proof work, we had to assume, that was too far, we had to assume first that the model was linear and that the betas were constant. We had to assume that the expected value of the error term is equal to zero. And we had to assume that x was non-stochastic or fixed. We did not have to assume anything about the rank of x, although we know that or less won't even run if we don't have that. And most significantly, we didn't have to assume these rather restrictive conditions on errors not being correlated with each other and having constant variance, though we didn't need that for this. So one thing this tells you, among others, is that this property holds when U is heteroscedastic, which is to say, does not have constant variance. And the reason I'm making this a point is that there are lots of people who have gone through OLS course or some other econometrics course and who memorize these assumptions and then are always on the lookout for, hey, I need to make sure that my five assumptions hold or else, you know, or else, or else something. And what they usually include is, or else my model is completely useless, right? My model is bad. Well, it's gonna have some challenges, but exactly how the model is deficient is gonna depend on which assumptions are wrong and how those assumptions link through to the properties of OLS. So in the case, for example, where we have non-constant error variance, which is to say, heteroscedasticity, which we're gonna talk about later, we would still expect our OLS estimates of beta hat to be unbiased. What I'm telling you is that it's very important to understand the peculiarities or the specific nature of how assumptions link through to results because that's gonna enable you to diagnose regression specification problems and then make sense of how bad the consequences of those problems are gonna be. You don't wanna be in a position of underestimating or overestimating the deficiencies in your results and you wanna have an understanding of how you can use those responsibly and this kind of demonstration, hopefully, will enable you to make that assessment. There are gonna be some happy times when we don't need all of the assumptions to prove some kind of result. And you saw last time we didn't need all five assumptions to prove a result. And in fact, we don't even need all four of the assumptions we used to prove the result that beta hat is an unbiased estimate of beta. Let me give you an example. Suppose that we make a slightly different assumption. I'm gonna start right here. So all the proof that I did up to this point right there is gonna be the same as last time. So let me get all this stuff in here. Drag this up here. Everything about the proof is gonna be the same up to this point except when I get to here. So X transpose X, X inverse X transpose times the quantity, the expected value of X beta plus U. Instead of writing what I wrote last time using the assumption that the expectation of the error term is zero, I'm going to write X transpose X inverse X transpose times X beta plus X transpose X inverse X transpose U. So I didn't invoke the assumption that the expected value of U is zero. Rather, oh actually I need to put an expectation in here. I need to say that the expected value of this quantity goes here. So I'm not gonna just quickly go to the point where I can just say all these X's are fixed and the expectation of U is zero. I am going to, in other words, suppose that X is non-fixed and non-stochastic. What that means is that I can't make the move I made last time of just writing zero because I'm not able to move the expectation in past all these X's into the U. I've gotta deal with the fact that these two things are now inextricably bound together and random. So the expectation might be something crazy. Well what I'm gonna do is say, and you can sort of see I've written it already down here, instead of assuming that the X's are non-stochastic, I'm gonna say, no, the X's are random. And maybe that's a more realistic view on the world. It's not the case that the independent variables exist and don't come out of some causal process of their own that has some kind of error term associated with it. If we re-ran history, we would get different values of the independent variables because history would, the little peculiar noisy parts of history would be a little different and those would create slightly different worlds, different X's. So X's are random, that's more realistic. Instead of what I'm gonna assume is that X is not correlated with U. This is a way of saying X is not correlated with U. Now, how do I know that this crazy expected value function that I wrote down is equivalent to saying X is not correlated with U? Well remember our formula for beta hat. Beta hat equals X transpose X inverse X transpose Y. So anything you put in for Y right here is equivalent to running a regression of X on Y, okay? Well, X transpose X inverse X transpose U is running a regression of X on U. So that's one way of looking at it. So what we're doing is saying, okay, there's this alpha here and X transpose X, X transpose U is alpha hat and what we expect is that alpha hat is always zero. In other words, X adds no information to be able to predict U. U is orthogonal to X if you wanna call back on a previous lecture. There's no projection of U on X if you wanna call back to a previous lecture. Another way of thinking about this is to sort of think back to the time in the lectures where we were not considering OLS to be a reflection of the true data generating process but merely an approximation to an unknown conditional mean, right? Remember at the beginning of this lecture we said one of the things we could say about OLS is that the expectation of Y given X is this unknown DGP and OLS provides the best linear approximation to this quantity no matter what process generates this quantity. Well, it's kind of like what we're saying up here is kind of like saying, well, there's this expectation of U given X and we're gonna assume that there's no relationship between those two things. That the expectation of U given X is always zero. We could actually combine the assumption we made here and also the expectation that U equals zero to generate this assumption here. These are somewhat equivalent. Okay, so that's what the assumption means. There's no correlation between X and U. What we can do then is say, all right, let's use this assumption to finish the proof. Well, it's kind of one step because we're just gonna say that this whole thing equals zero and of course this times this equals I or one just like we did last time. So we can just go straight to the expected value of beta hat equals beta end of proof. So what we've shown here is that there can sometimes be more than one pathway to the same answer and those pathways can differ a bit in terms of their realism or the willingness to which with the willingness we have to accept the assumptions and actually most encouragingly, if there are multiple pathways to an answer, we might be able to invoke the pathway that's the best approximation to whatever particular data generating process we believe that we're facing in a certain data set. Now, I've been talking a lot about this unbiasedness property and it's generally considered one of the more important properties of OLS. But what I wanna do right now is take a little off ramp and tell you about a model that is biased but still useful and so often useful. And it's in fact gonna lead us to think about other definitions for usefulness for an OLS model other than unbiasedness. So let's consider the auto-distributed lag model. Now, what's an auto-distributed lag model? We're going to suppose that we have a data set with lots of observations, capital N, so there's lots of different, actually I need to specify, these aren't just observations, these are units. Something like countries, people, units of study. And then we've also got T, which is a bunch of different time observations. So what we have is some kind of model where we have maybe one unit over multiple times, that would be a panel kind of data set, or maybe we have, I'm sorry, that would be a time series kind of data set, or maybe we have many units over many times and that's what we would call a panel data set or a time series cross-sectional data set. An auto-distributed lag model could be, in principle, estimated on either of those kinds of data sets. And what I'm going to do is show you that this kind of model is guaranteed to produce biased estimates of beta. And the way I'm going to show you this is to show you what we expect beta to be when we regress yt minus one on yt. So what I'm going to do is look at beta hat one, beta one hat, and beta hat one comes out of a regression of m sub i yt on m sub i yt minus one. Now what does this mean? This, as you may recall from last week, is the residual matrix from a regression on the constant term, or demeaning, not demeaning in the denigrating sense, but demeaning in the extracting the arithmetic mean sense. So that's what mi is, and what we're doing is just saying, let's get the constant beta zero out of there. Let's just focus on beta one, the relationship between yt and y. So I'm going to run this regression. So what would that regression look like? Well, beta one hat would be equal to, now it's x transpose x inverse x transpose y, but the role of x will be played by mi yt minus one. So I've got x transpose x, okay, inverse x transpose y, and this is y right here, mi yt, okay. Now, if you remember, when we did some proofs involving m and p, the residual and projection matrices, we flipped these things around in such a way that we use the item potency of mi, or m any m really, to simplify matters. And so just to show you real briefly how this is going to work here in case you forgot, I can use the transpose rule to rewrite this as yt minus one transpose mi transpose mi yt minus one, quantity inverse times yt minus one transpose mi transpose mi yt. Now, one of the properties that I taught you about m and p matrices is that they are symmetric. So m equals m transpose. So that means I can write yt minus one transpose mi mi yt minus one inverse yt minus one transpose mi mi yt. And now I can use the item potency of mi to write yt minus one transpose mi yt minus one inverse yt minus one transpose mi yt, okay. Now, go ahead and move this down a little bit. Now what I'm gonna do is substitute this yt right here for its known value from the DGP. So what I'm gonna do is write in yt minus one transpose mi yt minus one inverse yt minus one transpose mi quantity times the quantity yt minus one beta one plus u that comes from here. So this is now going right here like that. And now I can multiply these quantities out and I think you can see where this is going. I hope you can see where this is going. Yt minus one transpose mi yt minus one inverse times yt minus one transpose mi yt minus one beta one plus, now I'm gonna go to a second line, things are getting crazy, yt minus one transpose mi yt minus one inverse yt minus one transpose mi u. Now, this is the inverse of this and hence it all dies and we just have beta one. What about this? This is a regression of u on mi. The problem is we're stuck. This quantity I cannot actually get out of the proof. And there's a reason why we're stuck here. So what I'm gonna do is I'm gonna specify that I'm gonna come back in here and specify that this is actually u at time t. And there are two reasons that I can't get rid of that relationship or this boxed figure here because that's what I'd like to do. I'd like to just kill that and just get beta one. The problem is, firstly, yt minus one is stochastic. So I can't just do that trick where I take some expectations and then pass it through all of the non-stochastic components of this expression and just kill u or kill everything but u. Can't do that because yt minus one is stochastic. So if I was to come up here and say, take the expectation, expectation of beta one hat, what I'd get is beta one, that's just gonna be expected value beta one, plus the expected value of yt minus one transpose mi yt minus one inverse yt minus one transpose mi ut. And I'm not gonna be able to just say, okay, the expectation of this equals this here times the expectation of ut because these things all have expected values of their own. They're stochastic, so I can't do it. Number two, and this is the real tricky part or the real kind of bad thing, is that clearly, and this is always dangerous when a math person says clearly or obviously, but clearly yt minus one is correlated with ut minus one. What do I mean by that? Well, it's easier to see what I mean by that if I rewrite some of these terms a little bit differently. Now I'm gonna have to move all this junk down. There we go. There we go. So I can write this expression this way, yt equals beta zero plus beta one times yt minus one plus ut. That's the true dgp. But remember that yt equals beta zero plus beta one. Now yt minus one is actually equal to yt minus one hat plus ut minus one hat plus ut. And what this means is when I put in some value for, when I estimate errors, because in the estimation, this ut is, how do I put this? Okay, I've thought of a good way to explain this. So remember that yt minus one is composed of the systematic components of yt minus one and the random components of yt minus one. So even if we have a perfect model of yt minus one, it's gonna be the case that the realized value of yt minus one that goes into the next period's equation is gonna be composed of signal, which is this sort of ex-beta component, and the noise. So for example, sometimes your economy just has a really good year because of random influences, net of all the fundamentals of the economy, and that boom in growth in that year will influence what happens next year systematically, even though the origination of the boom was itself random. So the random component of last year's dependent variable y is gonna influence this year's y. And consequently, that noise component is going to be correlated with this year's noise component through this equation. All right, actually hold on, let me take that back. This year's observed dependent variable, or this year's observed independent variable yt minus one is gonna be correlated with last year's error term yt, or ut minus one. And if you go back to our assumptions, we need to assume that when we're assuming x is stochastic, go up here, one more, that there's no relationship, there's no correlation between x and u. Well, now we know by construction there is. This year's value of, the value of yt minus one is gonna be a function in part of last year's noise parameter ut minus one. So what that means is that if I'm using a dependent variable that has noise as an independent variable later, that noise is gonna pass through to the independent variable. And so now we've had the situation where noise is correlated with the regressor. So for both of these reasons, in colloquial terms, we are hosed. We can't rely on the proof techniques that we already used. And in fact, if you don't buy this, I can show you in R just by simulating an auto-distributed lag model. I can just show you that it don't work out good. Bad things happen from a bias perspective if you estimate these models. So let's go ahead and do that now. Let's do it. So let me come down here to RStudio. And here's the lecture file for this week. You can see that first what I'm doing is setting a random seed. You probably talked about this in your previous class. This just ensures that the random variables that I'm drawing out of my R program are gonna be exactly the same as the random variables that you're drawing when you run this code in your R program, even though we're running them at different times. The reason why that's possible is because the random number generating algorithms in R are not truly random, at least not the commonly used ones. There are ways of making them truly random, but the typical ones are just deterministic functions which happen to look random. Anyway, so now what I'm gonna do is draw some beta values out of a... Well, what I'm gonna do is create a matrix to store estimated values of beta hat out of the regression I'm gonna run, beta hat store. And I'm just gonna fill that with NAs for now. So I've got two columns because I'm estimating two betas. And I've got 5,000 rows because I'm estimating 5,000 runs of this regression. And I'm gonna name the columns of this matrix, whoops, I'm gonna move that around there. I'm gonna name the columns of this matrix constant and lag. So if I were to do, whoops. If I were to do head, B hat store, now it would indicate that the first column was my estimates of the constant term and the second column was my estimates of the lag term. Now the next line is setting up a progress bar. PB is arrow text progress bar, min zero, max 5,000, style three, what does this mean? Well, what this means is when I run through this loop, I am going to, R is gonna display a text progress bar that shows how long this loop is gone. It's gonna start at zero and go to 5,000, which matches the number of replications I'm running of this particular model. And style three just says it's gonna put a percentage indicator out here saying what percentage of the run is done. Now, this is the for loop that's gonna actually run my real simulation here. And the first thing I'm gonna do is set text progress bar at the equal, to equal the value of J, which is to say that whatever iteration of this loop that I'm on, tell the text progress bar function that's where I am. And you'll see how that creates a nice visual display here in a few minutes. What I wanna do then is create an auto distributed lag model out of some fake data process. And so what I've done is I've said that beta zero is 0.435 and beta one is 0.859. So Y, I plus one. So Y at time I plus one equals 0.435 plus 0.859 times Y in the previous period I plus a normally distributed random error term. And that goes from I equals one to 30. So when I equals one, Y two equals blah blah, Y one. That's why I had to create Y one up here because I had to start at a certain time and just sort of pull the first, the starting initialized value of Y out of thin air. That's where I pulled it from. I just said it myself. So I'm gonna create this dataset and when I create this dataset, I'm gonna get a string of Ys out of an auto distributed lag model. Let me just, I'm gonna just feed it a value of J and show you what goes on here. So if I go ahead and run this and then type Y, what I've got is a dataset of length 31. And the dataset of length 31 is all the different values of the dependent variable through the 31 periods of the model. Now what I wanna do is run a regression Yt against Yt minus one. And so it turns out that even though I have 31 observations, I only have 30 observations of the independent variable. Why is that the case? Well, because Yt equals Yt minus one plus error. And for the observation of the lag term Yt minus one, I don't have that observation of the lag for the first dependent variable in my dataset. So I sort of have to chop off that first observation because I don't have an X that corresponds to it. So I create two variables, Y standard equals Y2 to 31. So the last 30 observations. And then my X variable is Y1 to 30, which is the first 30 observations of the dependent variable. I just put an X matrix together which is a combination of the constant and the Y lag. And then I run a regression where I say, okay, beta hat equals X transpose X n versus X transpose Y where X is just one in the lag. And then I store that result in the row corresponding to the particular value of J this time around. So if I just run this little bit right here, B hat, there's my betas from my toy regression. And if I go to head B hat store, you can see they've been chunked in in my storage matrix for the first run. The constant ended up going right here. There's where my constant estimate was and there's my estimate for the lag Y. And now what I'm gonna do is I'm gonna run this 5,000 times. It'll become apparent pretty soon why I'm doing this. Let me just run this 5,000 times. You can see there's the text progress bar running merrily across the way to show how much of the result has been finished. So now I've got this beta hat store matrix with 5,000 replications of getting a dataset, generating a dataset out of the DGP and then running a model and seeing how close the model gets to the true DGP. And each dataset is a little different because of this right here, this normally distributed error, which is never quite the same each time. And it represents randomly repeating the observation of the world out of the same DGP over and over and over again. If this model is unbiased, it should be the case that that repeating of the estimation process is on average an accurate reflection of the true DGP. So the true DGP is 0.435 plus 0.859 times YT minus one. Alas, it doesn't happen. Here's the mean of our estimates minus the true value for, this is for the constant and this is for the, beta coefficient on YT minus one. Those numbers are not zero. They're not even close to zero. In fact, I've made a couple of plots for you here. So let me just throw those plots up there. So actually, I'm gonna throw this one up first. So this is the bias in the estimated constant for the auto-distributed lag model. Let me open this up a little bit. On the x-axis, what I have here is the estimate that I got out of each of the 5,000 repetitions of this process. And this is the estimated PDF or probability density function for that group of observations. The true value is supposed to be 0.435. And so that's indicated by this small dotted line here. The thick dotted line is the actual mean estimate from my 5,000 replications. And what you can see is my estimates for the constant term are too big on average. And in particular, they skew right. They skew too big. Similarly, for the beta term, it should, the actual value of the beta term should be 0.859 for the coefficient on YT minus one. But our series, our estimation procedure is skewing left in terms of its estimates for this parameter. And the mean of that estimate is too low by a considerable margin. It's biased downward. So we're getting constants on average that are too large and coefficients that are too small. That's a problem in terms of our substantive understanding of how things would work if we use this model in the real world. We would misstate the true relationship that exists. What I've just done is an example of a Monte Carlo study. I wanted to know how a particular estimator performed in a certain environment. And so in order to discover that, I did an experiment, a simulated experiment where I said here I know what the world is for sure. I'm gonna see how well this estimator recovers the world. And I can do that in this special case because on this little blackboard, I control the world. And it's pretty troubling when, even when you control the world and all the parameters are known, an accurately specified model does not recover the true coefficients. That's troubling. Even when I'm right, I'm wrong. That's not a good sign. So you might be thinking to yourself, well, geez, ADLs sure do suck. But they get published all the time and they get used all the time and there is a reason for that. So now I wanna talk a little bit about why it is anyone ever uses these. And the answer is they actually do turn out to be useful under a circumstance which is hopefully not too uncommon in political science. Incidentally, it occurs to me that you may not have seen one of the functions that I used in this R code before. Coming down here, you may not have seen the density function. The density function is a way of recovering information about a probability density out of data set. So you feed it in a vector and it spits back out its best guess at the probability density out of which that, or any whatever density that data came out of. It operates in a way similar to the not Araya Watson nonparametric estimate of mean that we've talked about in the past, except instead of estimating y given x, it estimates the frequency of x given x. And its estimate of the frequency of x is a weighted average of the number of occurrences of x near any particular value. So for example, coming down here to this ADL model density plot, when the estimate of the coefficient on yt minus one is 0.4, we see in the data set we expect roughly, I don't know, the density is about 0.25. So some frequency associated with that outcome, maybe 10% or something of the opposite or maybe actually closer to 2%, 3% are around that area. So what it's all it's doing is just saying, well, take the number of observations proximate to this particular value of x and derive a frequency estimate out of the weighted, kernel weighted number of observations that are near that estimate. The kernel weighting happens by a variety of kernels or options. I believe the default for density might be the Apache-Nikov kernel, but they're generally speaking downward sloping, the L-shaped weight functions is similar looking to the normal distribution. So getting back here, I've shown that ADL models are biased, but an ADL model might be consistent. I mean, in fact, it is consistent, which is useful, but before you could know that, it requires me to find what consistency means. What do I mean when I say consistency? What I mean when I say consistency is the limit as n approaches infinity of the expectation of beta hat equals beta. So what am I talking about here? Well, limits are different from expectations in the sense that expectations are true regardless of sample size. So unbiasedness is a property of e beta hat alone regardless of the sample size. Now it might be the case that in very small samples, our estimate of beta hat is going to be quite variable, and in fact, it will be very variable. There'll be a lot of variation in it, but if you did it a large number of times, a thousand times on average, you'd hit the right target. It's just that your pattern would be very spread out. You'd be missing a lot, but on average you would be hitting the center if your sample were small. Consistency is a property of large. In fact, technically of infinitely sized samples. So consistency is sort of like unbiasedness except it requires us to have a huge sample in order for us to be able to invoke that property. In your homework, you're gonna prove the consistency of auto-distributed lag models by repeating my Monte Carlo simulation in ever larger samples and showing that you do get convergence of estimates of beta zero and beta one to their true values as n gets bigger and bigger and bigger. But we can actually formally prove this property as well. And we're going to do so right now. So okay, let me state some basic characteristics here and then we'll move on to the proof. So when I go down here, four is stochastic. So well, let me just write this first. E of beta hat, generally speaking, is equal to beta plus X transpose X inverse X transpose U. And the problem with the ADL model that we ran into is that this was not equal to zero in our particular case. For a stochastic quantity, AY prime, or AY1, rather, we will say, that's not prime, that's a one. We will say that the probability limit as n approaches infinity is equal to A zero if the limit as n approaches infinity of the probability that the distance between this value, oh, looks like that wasn't one at all. There was just a mark on my note. So for a stochastic quantity, AY, we will say that the probability limit as n approaches infinity is equal to A zero if the limit as n approaches infinity of that the probability that the distance between AY and A zero is less than epsilon equals one for small epsilon. So what am I saying here? Well, what I'm saying is, let me move this down again, the distance between some function of the random quantity, Y, that's AY, to the point A zero. So the gap between our estimate and the true value of some quantity goes to zero as n goes to infinity. So in our case, our estimate of these beta coefficients, the distance from that estimate to the true value needs to go to zero as n goes to, gets really, really big, goes to infinity. And the reason that we, well, we're gonna write this as the following. P lim or probability limit as n goes to infinity of AY equals A zero. And the reason we write this as a probability limit is that Y is a random quantity. So Y isn't always the same. And what that means is that this distance that we're interested in, which is to say this distance right here, the Euclidean norm, is also gonna be a random quantity. So we can't talk about it in terms of constants, we have to talk about it in terms of probabilities. And what we want to be true is that the probability that that distance is real small, which is to say less than epsilon, has to get close to one as n gets close to infinity. So because Y is random, that distance is never truly going to be zero as long as Y retains some randomness. But for very large samples, we should be able to get narrower and narrower and narrower in our estimates to the point where they're so close that the deviation between the true value and the estimated value gets vanishingly tiny. That's colloquially what we're saying here. And it might be simpler to write an example of this. It might be simpler to say, for example, what's the probability limit? So I'm gonna write this as an example. What's the probability limit as n goes to infinity of Y bar, which is the average value of Y? Well, what that's gonna be equal to is the probability limit as n goes to infinity of one over n sum from i is one to n Y i, which is, this is the mean of Y. Why is that true? Well, we know it's true because the expectation of Y bar equals the expectation of one over n sum of i equals one to n Y i, one over n is a constant, so we can rewrite this as one over n times the expected value, whoops, I kinda wrote that as a combination of a bracket and an e. The expected value of the sum of i equals one to n Y i. Now, this is a sum of numbers and a law of expectation says that the expectation of a sum is the sum of the expectations. What's our expectation for Y i? Well, our expectation for Y i is the mean. So what we have is one over n times the sum from i equals one to n of mu Y, which is to say one over n times n times mu Y, which is to say mu Y. That shows us that we're on target, that the average is on target, but we also need to show that there's gonna be some residual variability in that estimate and that residual variability goes to zero as the sample gets larger and larger. So this is only like, you might write this as only being the first step. We also need to show that the variance in Y bar converges. The variance of the sample mean converges to the population mean. So in case I'm throwing terms around a bit cavalierly, this is the sample mean of Y for any given sample, and this is the true population. So it's not enough to show that the sample mean is an expectation equal to the population mean. If we want the sample mean to be consistent, we also must show that the variability in the sample mean converges to zero. So if you put those two facts together, the variability gets smaller and smaller. The estimate is an average on target, so the distance between the target and the particular estimate is gonna get converged to zero. So the variance of Y bar or the sample mean of Y is equal to one over N squared times the quantity I equals one to N sigma squared. How do we know that that's true? Well, it turns out that there's a law of variances the variation of a random quantity X times a fixed constant A is A squared var X where X is random and A is a constant. So we're just using this fact in our proof up here because there's an intermediate step that I've omitted. That intermediate step, if I sort of scroll over here, is I want to know the variance of one over N sum I in one to N of Y, I. And so I am saying the variance of that quantity equals this constant N squared times the sum of each individual variance of each individual Y. How do I know that this second part is true? In other words, how can I take this part and write it as this? Well, the reason I can do that is because, I'm gonna have to grab this stuff, take this stuff, move it down here, take this stuff. So how do I know that I can rewrite this bit right here as this bit right here? Well, I know that because I can think of each observation Y as being what, so the observed Y I is equal to some true Y, which is to say like U Y plus a random error component I. And I'm going to assume that U I is distributed independently and identically with variance sigma squared. And it's a property of random variables that the variance of X plus Y, where X and Y are random is equal to the variance of X plus the variance of Y plus two times the covariance of X and Y. Well, because I've assumed that these U components are independently and identically distributed, that must mean that this covariance term equals zero because they're independently distributed. So in other words, they're not, they don't co-variate with each other at all. They're not related to each other. Their covariance is zero. The regression of some components of the error of other components of the error would yield zero. So this bit right here is just saying, I can take the variance of these individual random variables Y and I can add them all up and I don't need to worry about the covariance terms because by definition they're equal to zero. So what this gets us is this quantity might need to move this down a little bit. Let's go ahead and move this down a little bit. Okay, we won't move that down a little bit. What I'll do instead is copy, piece of crap. Okay, so I'm just gonna rewrite it the old fashioned way. Variance of Y bar equals one over N squared plus the sum from I equals one to N, or I'm sorry, this is times my apologies, times the sum of Y equals one to N sigma squared. Or in other words, one over N squared times N sigma squared, which is one over N sigma squared. Now, remember we're taking the probability limit so we need to figure out the limit of this thing as N goes to infinity. What's the limit of one over N times a constant? And no matter what that constant is, well that's gonna go to one over infinity times some constant or zero. So that's the second part of what we needed. We need to know that our estimate is on target and the variability of that estimate goes to zero as our sample gets infinitely large. So one estimate on target, two variability in estimate diminishes to nil in large samples. So that's how we know the probability limit of a sample mean is consistent in the sense of giving an accurate estimate in large samples of the population mean. It's important to re-emphasize that probability limits are not the same as expectations. Expectations hold in small samples or in large samples. Consistency results only hold in large samples. Furthermore, it's also the case that the expectation of some function of Y is not necessarily equal to the function of the expectation of Y, where F here is just any old function. But the P lim as n goes to infinity of F of Y, nice infinity there, equals F of the P lim as n goes to infinity of Y. So there's a substitution you can make in probability limits that you can't make in expectations. So consistency results are actually easier to obtain in some cases. Okay, so now I've filled you in on what consistency means. Now what I'm gonna do is show that ADLs by this definition are consistent by making one assumption, which is that the expectation between U at time T and X at time T equals zero. So what I'm gonna do is write down that beta equals beta hat plus X transpose X inverse X transpose U. And that's where we left off in our little proof here. Here's, there it is. This is expectation of beta hat, there's beta. This is X transpose X, X transpose U. It's just that the role of X is being played by YT minus one prime MI, or YT minus one MI. So what I wanna do now is take plims of this thing. And actually that shouldn't be a plus, that should be an equal sign. What I'm gonna do is take a plim as n goes to infinity of beta, and then on the other side, I'm gonna take the plim as n goes to infinity of beta hat plus X transpose X inverse X transpose U. Now, I've just shown you some substitutions that I can make. The first thing I'm gonna do is say the plim of this thing equals, what do I wanna write this? Ah, I'm actually gonna, I'm gonna do something a little different. I'm gonna write like this instead. I'm gonna write it as beta minus beta hat equals X transpose X inverse X transpose U. And then I'm gonna take plims at this point, infinity of the quantity X transpose X inverse X transpose U. Now, what I can do is multiply by a tricky form of one, and that multiplying by that tricky form of one is gonna enable me to do something. So let me just rewrite this and then show you what I did. N X transpose X inverse N X transpose U. One over, one over. So this thing here is inversed right there, there's the inverse, this is not. So basically what I've done is multiplied by n over n, one over n over one over n. So I've multiplied by going over here, one over n inverse, one over n, or n over n, or one. So that little multiplication move didn't change anything. It's just gonna let me do something tricky. And what that something is, is I'm gonna be able to dispose of infinite sums. So X transpose X is a squared sum of Xs. And if I didn't have that one over n in there, as I let n go to infinity, you just get bigger and bigger and bigger X sums. Multiplying by one over n is gonna shrink those back down to be finite. So now what I've got is, whoops, I can break this up as, and I wouldn't be able to do this with expectations as I just let you know, that should be X. So now, I'm gonna just call this thing S X transpose X, just a name for it, it's gonna be some quantity. And then I'm gonna say that's multiplied by one over n, or I'm sorry, the P limit, of one over n times the sum as I equals one, or I'm sorry, is T equals one to n of X, oh no, that's, yeah, that should be T, of X transpose U T. So I'm getting back into the time element here. All I've done is just take this column and written it out as a sum again. We can assume, it is reasonable to assume, that X transpose U or the sum from I equals one to n of X, I, U, T, or I'm sorry, this should be, I'm screwing up my subscripts, there we go, X, T, U, T equals zero, T. Why is that reasonable to assume? Well, it's reasonable to assume precisely because if you go all the way back to our auto-distributed lag model, there's nothing that links, where do I go here? So remember that the problem here, going back to the original section, was that we had a correlation that was pretty much obvious in the ADL model between Y T minus one and U T minus one. Because, I'm going down to here, Y T minus one is by definition composed of a signal element and a bit of noise element. So we're not gonna be able to say, so we can't say that all the X and all U and all U for all T are uncorrelated. That was the nature of the problem. We can, however, say that X at time T and U at time T are uncorrelated because going back up to the model, X at time T is Y T minus one, X at time T, I'm sorry, U at time T is U T and it's reasonable to assume that these two aren't correlated because even if, even though it's the case, that Y T minus one is composed of Y hat T minus one and U hat T minus one, these two things are not necessarily, and in fact, typically not correlated. So as long as we can assume that the error terms are not correlated across time, it's reasonable to assume that this X, which is a composite of last year's error, last time's error, and last year's signal is not correlated with this year's error. Now, of course, all of this is gonna fall apart if the errors are truly correlated across time, then all bets are often this consistency proof is no longer going to work. But if we can make that assumption, if we can jump to that, or not really jump to that conclusion, but just simply rest on that assumption, then we can say, okay, this here is equal to the following. So I'm gonna go up here and say, P lim as n goes to infinity of beta hat minus beta equals S X transpose X times the probability limit of the expected relationship between X and U, which is just zero. So we get that the probability limit of the gap between our estimates and our beta, our true value of beta is zero, and everything is okay. So to summarize what we've talked about, as long as U T and U T minus one, are uncorrelated for all T from one to however long the time is, say, capital T, then an ADL or auto-distributed lag model is consistent. Now, we've spent a lot of time naturally talking about whether our estimates of beta are in any way related to the actual values of beta. And sometimes the answer is yes in all sample sizes. Sometimes the answer is yes only in very large samples, as we've shown. Now, what we wanna talk about is, okay, that's all fine and dandy, but getting beta right on average is not the only important question. It's also important to know how variable our estimates of beta had are. Just because something is right on average doesn't necessarily mean it's useful. If the variability in our estimates is enormous, then the fact that if we did it 1,000 times, the cloud of 1,000 estimates would be centered on the true beta is not particularly useful because in fact, we only have one estimate of beta hat that we get from our one data set, and we want the variability of our estimate to be low enough that that one estimate is informative about the true state of the world. So the first step on figuring out whether this estimate is worth anything is trying to figure out how variable our estimate is. And to know that, we need to know the variance in beta hat. Well, what's the variance in beta hat? Well, beta hat is, I'm sorry, the variance in any quantity, I'm sorry, I should rewrite this. The variance in any quantity is just equal to the sum of squares of deviations of the quantity from its average. So in other words, if I write it out for beta hat here, if I take beta hat and subtract its expectation, and then I sum the squares of that, I get the variance. Oh, put the transpose in the wrong place. There we go. This is the, what is this gonna give me in a matrix sense? Well, think about beta hat as being a vector of betas here, beta one, beta two, we'll just use two of them. And there's gonna be two expected values. We'll call the mu one and mu two. Those are gonna be our mean beta hats or expected values of beta hat. And so this subtraction quantity here is just beta one, beta two, and each one's gonna be minus mu one minus mu two. And then we're gonna take that and multiply it by its transpose. So that's what we've got. And so using our row by column estimates here, what we get is beta one minus mu one squared, beta one minus mu one times beta one minus mu two, beta two minus mu two, sorry. Beta two minus mu two, beta times beta one minus mu one, and beta two minus mu two squared. So we've got a two by two matrix. And that makes sense because this is a two by one matrix. This is a one by two matrix. So the result should be a two by two matrix. This is the variance covariance matrix. It tells you how much beta varies. This is the variance of beta and how much, or beta one, this is the variance of beta two. And this is how much beta one covaries with beta two. So in other words, how much of our estimates of beta one and beta two, how much do they move together? I might actually be clear about this because it is real important to recognize that these are the estimates of beta and not the true values of beta. So you might come in here and put a hat in on all of these to remind you these are all properties of the estimates. What we wanna know is how accurate is our estimate of beta? Actually, it's even more informative to maybe write in, well, we know because of the unbiasedness of OLS that the expected value of beta hat is actually just the true value of beta. As long as we've got an unbiased estimate, that's so. So we can come in and write in, okay, beta one hat minus beta one, beta two hat minus beta two. We can come in and substitute there, right? So that should make it even clearer that what we've got here, I'm gonna write in all those betas, beta one, beta one, beta two, beta two, beta one, beta two. What we have here is a bunch of variances and covariances of beta hat with respect to its mean or the true value of beta hat. Now another way, well, actually, that's actually really all I need to say. The one thing I wanna add to that though, I'm gonna come down here and move these just a bit further, is this thing here, I can rewrite this two by two matrix as variance of beta one hat, whoops, covariance, beta one hat, beta two hat, covariance, beta two hat, beta one hat, and these two will be equal. And the variance of beta two hat. These two matrices, this one and this one, mean the same thing. And this is what we like to call the VCV or variance, covariance matrix of beta hat. And I've done it for an example where a beta hat is a vector of length two, so there are two coefficients here, but the same principle applies for a vector of length 10 or 20 or 100, all that changes is that this number of variance and covariance terms gets larger and larger, and the ultimate VCV matrix will end up being K by K, where K is the number of elements of beta hat. Now so far I've been writing this a little bit abstractly. In other words, I haven't done anything that you can actually estimate on data yet, I'm just talking about in general, this thing called a variance. So what we kinda need to know is what is beta hat minus beta for any particular observation. Well, beta hat minus beta is gonna be equal to X transpose X inverse X transpose U. How do we know that? Well, beta hat is X transpose X inverse X transpose Y, and beta is X transpose X inverse X transpose, hold on, I should, beta is equal to, all right, I know how to run and write this, so I'm actually gonna erase this and say, beta hat is X transpose X inverse X transpose Y. We know that Y equals X beta plus U in the real DGP, so if we were to do our usual kind of stuff and pre-multiply both sides by X transpose X inverse, X transpose, we get X transpose X inverse Y, beta plus X transpose X inverse X transpose U, right? This here from there is beta, whoops, is beta hat. So beta hat equals beta plus X transpose X inverse X transpose U, and ergo, beta hat minus beta equals X transpose X inverse X transpose U, so that's how we know that beta minus beta hat equals that quantity. So now what we can do is come in here and say, okay, we've got a bunch of, this is the variance term that we're trying to estimate, beta minus expected value of beta hat times beta hat minus the expected value of beta hat. So what we're gonna do is substitute this in for this. So I'm gonna come down here and write, the variance of beta hat, just write this over again, is equal to beta hat minus expected value of beta hat times beta hat minus the expected value of beta hat transpose, which is equal to beta hat minus beta times beta hat minus beta transpose. And this is what we get via unbiasedness of OLS. Then I can take this bit and plug it in right here and here, so I get X transpose X inverse X transpose U times X transpose X inverse X transpose U transpose. A real nice matrix playground. So now we just gotta use a bunch of rules of transposes to get rid of everything and make it look pretty. X transpose X inverse X transpose U is gonna be, okay, so I've got a transpose here. So I'm gonna say this is A and this is B. So I want B transpose A transpose X transpose X, X transpose X inverse X transpose transpose. I'm gonna just do that again. Whoops, where do you, when I'm at an X? X transpose X inverse X transpose U U transpose. Now X transpose X transpose transpose times X transpose X inverse transpose. And I'm gonna save us all a lot of time and just we know we've done this before. So I can rewrite this as X transpose X inverse X transpose U U transpose X times X transpose X inverse. There you go. Well, we're just about done. Now what we can do is say, all right, we let's use some facts. Assumption three tells us the expected value of U is zero the expected value of U is zero and E UI UJ, I should say expected value of any particular term of U is zero and the expected value of UI UJ is also equal to sigma squared by a sum, I'm sorry. Let me write all this over again since I screwed up pretty much everything I could have. So the expected value of any particular term UI is equal to sigma squared, which is to say that the variance of any term UI is homoscedastic. Secondly, for any two terms, their covariance is going to be zero and both of these things going back up here are given by assumption three. So these things are true because we assumed them to be true in A three and so are these things or this thing I guess. And that means we can say, all right, what is U U transpose? Well, U U transpose is a matrix U is N by one, U transpose is one by N, so U U transpose is N by N. It's on diagonal elements are going to be equal to U one squared, U two squared and so on. It's off diagonal elements are going to be equal to U one, U two, U one, U three, and so on. U two, U three, U one, U two, and so on. U one, U three, U two, U three, and so on. Like that. That's going to be U U prime. And that's something we actually made reference to very, very far back when we were talking about matrix algebra. I sort of hinted that, hey, we need to know what a vector times its transpose equals what kind of structure it's going to be because that's going to become really important. Well, now it's really important. There it is right there. It's important because U U transpose has that form and now we can use these assumptions to say every single one of the on diagonal elements, all of these, U three squared there, I can replace those with sigma squared, sigma squared, and all the off diagonal elements, like those, I can replace with zero, zero, zero, zero, zero, sigma squared, and so on. So I can rewrite this bit right here as this matrix, and furthermore, I can write that matrix as sigma squared i n by n where i is an n by n identity matrix. So this is going to make things boil down very, very fast. So going up here and rewriting this little equation right here, what I've got is x transpose x inverse x transpose sigma squared i, x, x transpose x inverse. Now, this is a scalar constant, so I can move it outside, I can move it over here. That's going to give me x transpose x inverse x transpose i, x, x transpose x inverse. i times anything, any matrix, is just itself again. So now we have sigma squared x transpose x inverse, x transpose x, x transpose x inverse. Now I can say any matrix times its inverse is i or one. So finally I get sigma squared x transpose x inverse. Great, so this is one of those formulas you might consider tattooing on your arm. This is the matrix formula that gives you the variance of your beta hat estimates. So this is the formula, this right here is the formula for, whoops, variance beta hat, aka the VCV matrix for your regression. So that's the formula that Stata or R or any of these programs use to give you the friendly standard errors that come out of your matrices or come out of your regressions. It's just the diagonal elements of this matrix, square root of that. That said, that's all the standard areas. Now I've got one little thing, not really a little thing that I've left out. So sigma squared you know is the estimate, or I'm sorry, is ui squared. That's true by assumption. It's the homoscedasticity assumption, but we need a way to estimate that. So we need a way to estimate sigma squared or the expected value of ui squared, but this is actually trickier than it looks. This is tricky. The reason why it's tricky is, goes back to something we learned about OLS a while ago. What we learned about OLS a while ago, whoops, is the following. So you may remember that OLS can be thought of as a geometric construction. If we've got some variable y here and some variable x, what OLS regression does is it tries to pick the line or pick the beta that stretches x out to a point, like so, where u hat is minimized. And as we learned, u hat is minimized when the angle between those two things is right, it's 90 degrees. The problem here is unless beta hat equals beta exactly, the length of u is necessarily going to be bigger than the length of u hat because u hat is chosen as a minimum. So in other words, it's unlikely that the true beta hat is exactly equal to the beta hat that we picked because there's noise in our estimate and that noise is creating error that OLS is trying to minimize. That means this error vector here is gonna be real, is gonna be as small as it can be. The real error vector may not be at a right angle to x, who knows, and its length is gonna be at best the same length as u hat and often longer. A way of sort of summarizing this qualitatively is that OLS underestimates u. U hat is an underestimate of u because OLS minimizes u hat squared. So one over n, u transpose u, or I'm sorry, u hat transpose u hat, or the variance of u hat underestimates the variance of u. It can be shown and I'm not gonna bother doing this but I just wanna let you know that it can be shown, it can be proved that the expected value of one over n u hat transpose u hat equals n over k over n sigma squared or the expected value of u hat transpose u hat equals n times n over k divided by n sigma hat squared, or sigma squared rather, where k is the rank. There are multiple proofs of this statement. I am not gonna show them to you right now but you could find a proof, for example, on page 107 to 110 of Davidson and McKinnon. If you wanted to look that up, it's there in your book. You will prove quote unquote this statement in your homework using simulations. Simulation is not a formal proof but they'll at least give you some evidence that in fact it is true. But I'm just gonna ask you to accept the, you know, accept that I've read the proof and that it's true and go to the next step which is that it must be that sigma hat squared equals one over n minus k sigma hat zero squared equals one over n minus k u hat transpose u hat. So all that we're doing here is saying this is kind of our naive estimate of the variability in y, or in the error term I should say, the variability in the error term. And this is our way of upwaiting that error term to inflate it a little bit. Instead of dividing by n like we would normally, we're going to divide by n minus k. So normally the variance of some quantity is one over n y transpose y. We are gonna calculate one over n minus k y, or well in this case u hat transpose u hat, because this is smaller than this. Or I'm sorry, it's, or the exact opposite of what I meant. It's larger. There's a bigger number or a smaller number in the denominator so the total quantity is bigger. So all that rounds up or rounds out to the following finding. The variance covariance of matrix of some estimate of beta hat equals x transpose x negative one times the quantity one over n minus k u hat transpose u hat. All right, so now I'm gonna wrap up by talking about some properties of this variance covariance matrix that we've just constructed. It turns out that the estimates of the variance covariance matrix that come out of OLS are the most efficient estimates possible with a linear model, which is to say they're the smallest possible accurate estimates of the variance or the variability in beta hat. And that doesn't necessarily mean that they're smallest possible estimates. So for example, I could just say by fiat that my variance is epsilon, bam. Like I'm gonna estimate with OLS and I'm just gonna say that my standard errors are all zero or very close to zero. The problem is that'd be wrong. And in specifically, I mean a very specific thing by wrong. What I mean by that is that the 95% confidence intervals for the beta hat that came out of this estimate, oops, I should say that, came out of this estimate would not cover the true beta 95% of the time. And in my stupid example here, I suspect it would cover the true beta a very, very small proportion of the time because the variability is way too narrow. So the OLS estimates of variance of beta hat are the most efficient or smallest variance estimates for which this is true. And in fact, there's a way of stating this a little more formally, which is the so-called Gauss-Markov theorem. This is in your Davidson and McKinnon book if you care to look it up. So the Gauss-Markov theorem says that if the expected value of the error term given X, and this is the true error term, not the OLS estimate there of the error term, if that correlation is zero, oh, I'm sorry, tell me to take that back. If U has mean zero for every value of X and the error term is homeless cadastic. So that's what this means here if the error term is homeless cadastic and the model is properly specified, so that assumption one in other words holds, then the OLS estimator beta hat is more efficient, which is to say lower variance than any other linear unbiased estimator. So this is sort of jumping down a bit. OLS is the best linear unbiased estimator under the five assumptions laid out if those assumptions are correct. Other estimators might be more efficient. They might have tighter variances, but that could only be true if they were non-linear possibly or maybe biased. And we can't do better in an unbiased estimator than OLS. That's what Gauss-Markov is telling us. Now I've listed the assumptions that are necessary to sustain this proof so that you know them. If you wanna see the actual proof, see the book. It's a little bit involved, specifically see the Davidson-McKinnon text which has a proof of this. What I'd rather spend my time talking about is this bias variance trade-off that we're presupposing here, bias variance trade-off. Now I don't wanna talk about this too much because it's not necessarily one of the core things, or core topics of OLS, but it's worth noting that estimators can be unbiased and they can be efficient or low variance. And what the Gauss-Markov theorem is telling us is if we look at all the possible linear estimators for which the bias is zero, okay? I just made up, this is like estimator one, estimator two, and these are linear estimators, linear estimator one, linear estimator two, and OLS. OLS is going to be, actually I should rewrite this graph because I believe it's a little misleading the way I've written it. Let me try this again. Because the reason I think it's misleading is because OLS is an unbiased estimator. So if you were to put OLS on this graph, it's actually gonna be more like a point. Nice. It's gonna be more like a point. Oh wow, apparently that means it thinks I'm erasing it. Let's try this. Let's just make my pen thicker. Okay, here's OLS. It has a zero bias and some variance determined by the formulas I've given you. Now there might be other linear estimators. So I might have some other linear estimate one and I might have some other linear estimate two. And LE one and LE two are unbiased because I've written them that way. But they're less efficient than OLS. They're the variance in beta hat for those estimators is larger, larger than it has to be. So what the blue assumption, appropriately blue in the marker here is telling us is that of all the linear estimators that are unbiased that are possible, OLS gets you the best, most efficient estimates. That doesn't mean we could not do better on variance. And there are two ways we could do better. First of all, we could come up with some kind of nonlinear estimator that's unbiased and more efficient. What would that estimator look like? Who knows? But the Gauss Markov theorem is just sort of saying, we're not considering that in the class of comparative estimators that we're looking at. So it's not that OLS is the best thing under the sun. It's just the best linear thing under the sun. And furthermore, it's the best linear unbiased thing under the sun. So maybe we could write a linear estimator. Oh, I'm gonna make the thick one here. I could make a linear estimator three that is biased but lower variance. And going back to my earlier analogy of shooting at a target, you can think of it a little like this. So here's my target and I've got a bull's-eye in the center of that target, okay? There's my bull's-eye, I'm gonna move that around so it's kind of in the center. Here's OLS, I've got my blue pen for OLS. OLS is gonna shoot and it's gonna get an unbiased estimate with some kind of variance that's lower than all the other alternative estimators. This nonlinear, I'm sorry, this linear estimator that's biased might be very, very efficient. Just like that. The problem is it's biased, right? It hits the wrong target. So we're shooting at the center of the target which is the true beta. Our data points here or our little dots are OLS estimates of beta hat. The gray are some other estimates but biased in some way. And whether this gray cloud is better than this blue cloud is not necessarily an automatic decision. You might say, well, the blue cloud is an automatic decision. But not necessarily. Suppose, for example, OLS had a very, very, very wide variance such that it wasn't really giving you a lot of information in any one particular sample. So for example, if OLS had a really wide estimator cloud here or estimate cloud here, you wouldn't know in any one sample if you had that one or that one. Hence, the estimates are gonna be terribly useful. In that case, it might be better to accept a little bit of bias and get a lot more certainty or a lot less variability in your estimates. So it's not necessarily, the Gauss-Markov theorem does not prove hands down that OLS is the best thing to do. It's just telling you that of all the linear unbiased estimates you can get, it will be the best, and again, emphasizing, if assumptions one through five are true, specifically if U has, at the expected value of zero given X is zero, expected value of U given X is zero, if U is homoscedastic and if the model is properly specified. So that is assumptions one, that's proper specification, two, mean zero, and actually three, homoscedasticity. So this is assumptions one, two, and three. Like over here, A1, A2, and A3 are invoked, and that's what's needed to sustain Gauss-Markov. So there's one more thing I wanna show you and that's how to get the BCB matrix estimate out of R in a matrix sense. So here I've got this example all ginned up for you and what I'm doing here is creating a data matrix of 100 random variables distributed uniformly between zero and 10, two columns, so I'm gonna have two variables and that means I actually have two variables each of length 50. So this is actually a data set of size 50. And I'm gonna column bind one with X. So now I'm gonna have three variables, the constant and then the two X values. I'm gonna draw a beta matrix that I just made up. I'm gonna call it 2.8, 1.3, and 6.5. So this is gonna be the constant beta. This is gonna be the beta for X1 and this is gonna be the beta for X2. And then I'm gonna say that X, I'm sorry, Y, is X beta plus a normally distributed random variable with mean zero and standard deviation two. So if I just run this stuff, let me just give you a sense of what this looks like. Here's what the final X matrix looks like. It's three variables, a constant, X1 and X2. There's the Y, it's a vector and by one. Now what I wanna do is run a regression on this data and I'm gonna run it in longhand with matrices. So here's where I'm running that regression. And you can see I'm calculating beta hat using X transpose X inverse X transpose Y, longhand just like we do in our lectures. So I'm gonna run that and get estimates for beta. Whoops, and if I type in beta hat, whoops, there you go. My estimates for beta are 2.5, 1.4 and 6.4, which are pretty close to 2.8, 1.3 and 6.5, which is what we expected. Now what I'm gonna do is get the VCV matrix. So in order to get the VCV matrix, the first thing I'm gonna do is calculate U hat. U hat is Y minus X beta hat. So there it is. This is the residuals. There's my residuals right there. This is the gap between Y and X. And in fact, I could plot this if I wanted to say, let's actually do a quick plot. So first I'm gonna predict Y hat and it looks Y dot hat. And Y hat is gonna be equal to X times beta hat. Maybe I should put in the right symbol there. Put in times, there we go. And now if I do a plot of Y against Y hat, U hat is just the distance from the Y equals X line. So abline hr, m, abline c01, up to Y equals one. There we go. The U hat is just the vertical gap from Y hat to Y for each one of these points. That's the collection of residuals. So that's what we're getting with U. Now the VCV matrix is, as we just laid out, U transpose U times one over N minus K times X transpose X inverse. So that is this. There's U transpose U and then there's the VCV. So if I run that and then type in VCV, bam, whoops, should be lowercase VCV, I got it. Now how do I know that I did the right thing? How do I know I got the right VCV matrix? Well, there are a couple of ways I could figure that out. One way I could figure that out is to run a linear model of Y against X. I think this will work, let's see if this works. Sure it does. So if I do a summary of L, M, Y equals X, first of all, I get the same coefficients I got. So the hat, oh, it's including an intercept in there. So let me take out that intercept because I've already got an intercept in X. There we go. So now if I compare that to beta hat, they're the same. That's reassuring, I did the right thing. Now I wanna see if my variance covariance matrix is the same. There are a couple of ways I could do that. You can see, for example, that the standard error is reported right here. And if I got my VCV matrix right, these standard errors should be the square roots of the diagonal elements of my VCV. How do you get the diagonal elements of your VCV? Well, you go trace, no, wait a minute, trace, I think it's gonna add them. Yeah, it's not gonna work. Let's see, how do I get the diagonal? Oh, I know I can get the diagonal elements. I can say, Diag three times VCV. No, that didn't work either. All right. How am I gonna get the, I don't know how do I extract the diagonal elements? Let's think about this. Oh, okay, I think I can do this. I can just say, give me the diagonal of the VCV. I think that will work. So let's try that. Ah, got it. So there's the diagonal elements of the VCV matrix I just got there. And now if I take the square root of that, square root, run it. I should get the same standard error. So 6.677, 0.677, 0.106, 0.106, 0.871. So now I'm pretty confident that I'm right. And if I really wanna go nuts, I can go further. I can say, all right, well, let's look at the VCV versus V-Cove of L, M, Y, given X. Is that a cat-box? Yeah. Minus one. V-Cove extracts the variance covariance matrix out of a linear model object. And you can see that the V-Cove for the canned linear model routine is extracting the exact same VCV as we calculated manually. So now we can be pretty sure we're right. All right, so I'm gonna cut it off there. And I will see you next week. Hope to see you then.