 Hi, this is Jess Nessary, and this is the third week of PolySci 509, the linear model. And this week, we're going to talk about what an ordinary least squares regression means in a geometric sense. And I don't necessarily mean, you know, how to plot y against x on a scatterplot. That is a very useful skill. What I mean is how to understand what OLS regression is doing in a deep sense from a geometric perspective. And one of the immediate ways you'll know we're not going to do the typical scatterplot thing is the way we're going to interpret variable vectors. So let me start off by just writing down a couple of variable vectors here. So suppose we've got two vectors, x1 with two observations, 5, 2, and x2 with two observations, 1, 4. Now these are variable column vectors. So x1 might be our first independent variable of interest, x2 might be our second independent variable of interest. And what we're going to do is we're going to plot these vectors in observation space. So I'm going to pull up, let's try that again, I'm going to pull up a graph here and I'm going to say that the x-axis is observation 1 and the y-axis is observation 2. What that means is you're used to seeing graphs where say y is on the y-axis and x is on the x-axis and those correspond to variables. In this case the variables are actually going to be vectors in this space defined by observations. What do I mean by that? So let me just show you by drawing some hash marks on here, 1, 2, 3, 4, 5, 6, 7. Let's consider x the x1 vector here. I was going to plot the x1 vector in this space, I would mark off, let's see, 5 on observation 1 and 2 on observation 2 and I'd make a little dot there. And then I come in here and draw a vector, there we go. And if I write down this here is x1. This is the x1 variable as represented in observation space. The first observation has a value for x1 of 5 and the second observation has a value for x1 of 2. So this is a data set with two observations and it's kind of a small data set, it's a toy data set in fact. Everything I'm going to say applies to larger dimensional data spaces but you'll see why we're sticking with two observations for the time being a little later in the program so to speak. I'm sorry, x2, variable 2 would be at location, let's see, 1, 1, 2, 3, 4, so right there. So if I come in here and draw a vector, there we go, write in x2, there's the location of variable x2 in this two dimensional space. So I've got two vectors in two observational space, each vector corresponds to a variable. Now suppose I wanted to add these two vectors which is sort of what I'm getting at down here. Well you know from a couple of weeks ago how to add vectors numerically so if I just say x1 plus x2, you know that what we're going to do is take each of these column vectors and add them element by element. So 5 plus 1 is 6 and 2 plus 4 is 6 so x1 plus x2 is 6. Now watch this, if I come over here and plot x1 plus x2 on this graph so let's say 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6 here, it would go right there, right there, there we go. If I were to graph this it would have the location indicated by the point and I could actually even put in a vector right there, there we go. This is x1 plus x2 but you might notice that x1 plus x2 is actually the sum of these two vectors graphically as well as numerically. What do I mean by that? If I take x1 right here and there's its tail, there's its head, if I were to drag it up to x2 right here, like just like that, what you'd see is that the x1 plus x2 point is located at exactly the place where if I had started with x2 at the point or head of x2 and put the tail of x1 right there that would lead me to the point x1 plus x2. The same would go if I started at the head of x1 and put the tail of x2 at the end of it so if I came over here and said, okay, there's x2 right there, x2 is also that vector right there, I still reach the same point. This is all just kind of a way of saying that when you add two vectors you don't just add them numerically, there's a graphical interpretation specifically, adding two vectors is the equivalent of putting two vectors head to tail in a graphical sense. That's equivalent to adding vectors, oops, not is, in an arithmetical sense, or in an arithmetic sense, so there you go, there's your first fact about adding vectors, okay, so that's adding vectors, now what about subtracting them? Well, subtracting them as it turns out is conceptually very similar to adding them, so what I'm going to do here is do another axis right there, see if I can, there we go, subtracting them, instead of putting them head to tail in the intended direction, you kind of put them head to tail in the opposite direction that the vector was originally intended to go, what do I mean by that? Well, let's come down here and look at x1 minus x2, so x1 minus x2, now x1 is 5, 2, x1 is 1, 4, so x1 minus x2 would be 4, negative 2, I've got negative numbers, at least in the y dimension, so I'm going to need to extend that downward a little bit there, now let me draw in some hash marks, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 3, okay, so again plotting x1 and x2, x1 is 5, 2, so 1, 2, 3, 4, 5, 1, 2, there it is, put in that vector there, there's x1, x2 is 1, 4, so I'm just going to go straight there, 1, 1, 2, 3, 4, huh, nice spacing, if I was going to subtract these vectors, what I would do is I would take, here's the x2 vector, here's the x1 vector, I would take this x2 vector, right here, there we go, let me copy that, come over here, paste it, if I was going to add them I'd put it up here, I'd put them head to tail, if I'm going to subtract it I kind of reverse the direction, so I put it tail to head as it were, and that if I've done this correctly should give me a point of 4 negative 2, now 1, 2, 3, 4, negative 1, 2, and you can see my spacing is a bit off there, but that's x2, or x1, sorry, x minus x2, this is observation 1, this is observation 2, so subtracting vectors is very similar to adding them, you just sort of go in the opposite direction, we actually don't do a great deal of matrix subtraction, or I'm sorry, vector subtraction in what we do, but it's nice to see that connection there just in case it ever becomes relevant for you, so that's, I'm sad to say, kind of the easy part of what we're going to talk about today, now let me start getting into the more meaningful for one thing, but also a little bit more difficult material, so as we've just seen vectors do have a graphical interpretation, vectors can also be thought of as forming a subspace in the real number hypercube k, so the first question you might be asking yourself is, what is, what did I just say, what is the real number hypercube k, well, all we mean by that is just the space defined by placing real number lines at right angles, k real number lines at right angles to each other to form a space, so this is one that you've seen a lot, probably, oops, I'll read the two arrows there, this is the Cartesian plane, okay, that is R2, because here's the first real number line one, and here's the first real number line one, and we've put those two spaces together at right angles to form a real number space two, here's, well, actually, let me just, well, let's see, oops, geez, Kershiv, there we go, okay, so if I label these, here's there's z, there's x, and there's y, and you got to imagine why kind of shooting out of your screen is a third dimension, this here is R3, it's the space defined by three real number lines, all perpendicular to each other, you can see all the angles here are right, right angle, and then if I could draw it in here, it has to be a right angle, and each one of those lines is a real number line, R1 is this friendly guy, you probably have seen this person many times, at least since you've been in algebra one, that's R1, so all we mean by this is there's some space created by the real number line, or the number of real number lines place it right angles to each other, and vectors form a subspace of this space where K is the number of vectors that you have, so let me give you a sense of what I mean here, so suppose I have two vectors, I'm going to call this vector x, and it's equal to 1, 0, and I've got a second vector, I'll call it y, and this vector is equal to 0, 1, this pair of vectors forms a subspace in R2, because there are two vectors, and this is what it would look like, let's say, whoops, let's say this is the point 1, 0, and this is the point 0, 1, that would make this vector here, x, sometimes we've notatized the vector with a little hat there, and here's vector y, this is a vector space or a subspace of R2, it only goes out to 1, 0, and 0, 1, and that's it, this is the kind of space that you might be familiar with, you can see that you've got nice right angles here, like you're kind of used to with the Cartesian plotting plane, so it's a friendly subspace, but there are lots of different possible subspaces, some of which are not so friendly, so let me just partition off some space here, let me give you two other vectors that are going to be a little less friendly, so here's x3, 2, and y equals 1, 4, now what would a subspace created by these two vectors look like? Well, I'm just going to plot them out, so I'm just going to specify an origin point here and see, 1, 2, 3, 1, 2, and if I come up to my vector line, there we go, that is x, x, y is 1, 4, so that's in 1, 4, God, my spacing is awful, okay, good enough, there's y, this is also a subspace, you can see the angle between these two vectors is very obviously not a right angle, but this subspace is related to this subspace here, the first subspace feature, these two things are related to each other, this here and this here, the reason why they're related is because it's going to turn out that we can represent points in the space of r2 using either coordinate systems and in some sense they're going to be equivalent to each other, so the takeaway point of this is that any collection of vectors you have can in principle form a subspace inside of the space of rk where k is the number of vectors, now one thing that you, well actually let's just skip right down to this point and then we'll come back to what I was about to say, so one point I wanted to make here is that any point in the space of k linearly independent column vectors out of a matrix x and by k, so k is the number of columns, we've got k column vectors, can be expressed as a combination of the column vectors in x, now what do I mean by that, well I mean the following thing, so let's take as an example this vector space that we started off with, 1, 0 and 0, 1, I'm going to take this and see if I can grab, well I'll just create a new space here, there we go here's a nice space and I'm going to draw the x and y vectors in this space, so if I come in I'm going to come down here and say okay 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, drawing x and y, so x is 1, 0, y is 0, 1, I can draw those vectors in this space like so, there's x, there's y, there's x and there's y, okay any point in this space, so this is our two space right here can be represented as a linear combination of these two vectors, let me just give you an example, let's take a random point, 4, 2, this is point 4 comma 2, I can represent that point as 4x plus 2y and here I'm just going to put these hats on these vectors so that you don't get confused with variables later, there we go, so I can do that for any point in this space that I want to going out to infinity or negative infinity in any direction, so all I have to do is take okay I need 4x vectors to get out to the 4 and 2 to get to the 2 and y and that's it, now what's a little tricky about this is I can also do this for the crazier vectors 3, 2, and 1, 4, so without belaboring this too much let me take a little graph here and I'm going to draw in 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, okay so now let's take those crazier vectors that's 3, 2, and 1, 4, so here's 1, 2, 3, 2, so 1, 2, 3, 2, and what was the other one? 1, 4, okay so there are my two vectors this is x and this is y and now my original point here was 4, 2 so I've got 1, 2, 3, 4, 1, 2 there we go now how could I reach this point in x and y well what I could do is basically stretch or shrink these x and y vectors to get to that point and it's what what I'm telling you is that with these two vectors I can still reach any point in r2 which this is you know r2 by stretching and shrinking x and y the real question is how far do I have to shrink them to get there and that's it's a little bit hard to see although graphically you can get a sense that what we're going to do is stretch x out a little bit and then tack on a negative version of y that that reaches downward and we're going to shorten it a lot too it actually turns out that there's a fairly easy way to figure out exactly how much we need to stretch or shorten x and y so x is equal to 3, 2 and y is equal to 1, 4 and what we want to do is okay say I'm going to stretch x by an amount a so that's a multiplier so I could maybe double it by 2, 2x or I could put it in half by a half x or I could go in the opposite direction with like negative x I want to stretch or shrink x and then I want to stretch or shrink y by a factor b and I want to get to the point 4, 2 so I've got 3a times 3, 2 plus b times 1, 4 equals 4, 2 now I can write that in a way that's going to be very helpful to me 3, 2, 1, 4 times a, b equals 4, 2 this is a system of linear equations like we did on the first week of class the that that I just wrote and this expression over here are equivalent in a mathematical sense and you can figure you can see that by doing the matrix multiplication and figuring out what equations you get by result but we've already covered that so I'm not going to belabor it too much so what we want to do is solve this equation for a, b to figure out how much we need to stretch and shrink x and y so as you might remember from the first and second weeks of class what we're going to do is pre-multiply both sides of this equation by the inverse of 3, 1, 2, 4 any matrix times its inverse is equal to i which is the matrix form of 1 so what we get is a, b on the left hand side and on the right hand side we get 3, 1, 2, 4 inverse times 4, 2 and that's a problem that one could solve by hand but it's it's just easier to come over to our studio and do it so here is my matrix a and you see I've already entered in all the right stuff for a oops come in there a 3, 1, 2, 4 just like on the screen and x, y is what I've called the resultant vector there I'm not sure why I called it that I just did so there's x, y equals 4, 2 that's the point we're trying to reach and so now if I take a inverse times x, y what I get is 1.4.2 so going back to my other thing here a, b equals 1.4 negative 0.2 is that negative yes it is negative 0.2 so what this means going back to my graph is that what I would do to get to this point using x and y is I would stretch x by about one and a half that's about right right about to about somewhere in here maybe and that's going to be my new x vector stretched out and then I would take y and use 0.2 of it so shrink it down by one fifth to one fifth of its size and go in the opposite direction so what I've got is and that's 1.4 x and negative 0.2 y and I can get to that point now I can repeat that process for any point in R2 and get there so what I'm kind of alluding to here is that this crazy x and y vector which you know are at some weird acute angle here not 90 degrees they can effectively serve as a as a as a plotting plane in the same way that the friendly 1 0 0 1 vectors that you're used to can also serve as a plotting plane you know normally we're used to saying you know something x and something y right 2x plus or x equals 2 or the some point is 4 comma 2 right all we're saying is that it's 4x plus 2 y well we can just as easily say that it's 1.4 x negative 0.2 y if by x and y we mean these weird kind of acute vectors all right now what I want to do is get to the graphical connection to ols regression so we've set up a lot of facts in abstract mathematical terms now we're going to pay off on that in ols so what I've written here is that in a linear regression the columns of x and by k so x and by k is the matrix of independent variables form a space in the real number hypercube of dimension k where k is the number of independent variables that you have including the constant term this space is a subspace of the n-dimensional space just defined by observations okay here we go so I'm going to start with a real basic basic space there we go okay this is a three observation space so this is like observation one and this is like observation two and this is like observation three so I got three observations okay now suppose I have two independent variables that I'm measuring so I've got two three data points and two variables okay uh let me put these variables into this observation space here's one of them here's the other one of them so we'll call this one x1 and we'll call that one x2 now you can imagine these two uh variables jutting out into the three dimensional space created by observations so we've got here n equals three three observations but only k equals two so only two variables what does that mean well that means that the subspace created by the independent variables is going to be two dimensional that is to say it's going to be a plane and that plane is going to float in the three dimensional space created by observations so look at this graph here you can sort of see x1 and x2 are forming a a plane right if I were to um oh boy here we go let's see if let's see if professor esri can draw I already know the answer to that is no but we'll give it a give her a shot anyway um so if I take x2 and uh sort of move it along and create a grid so I'm going to use a different color here actually I'm going to use gray gray is good um where's my cursor so if I just sort of sketch in these lines here this is up all I'm doing is just replicating x2 I'm moving it along the x1 uh the axis created by x1 okay now what I'm going to do is move x1 along the axis created by x2 so here's there we go you can see that what I've got is a plane floating in three dimensional space you can sort of think about it as a piece of paper like this one floating in the three dimensional space that we inhabit as a part of our lives same thing exists on this uh graph I've got here um it's just that in this case the three dimensions aren't height width and depth they're observation one observation two and observation three um that's the three dimensional version now we could you might be able to see this a little bit uh more easily if we consider a two observations let me uh move this down a little bit so uh let me create a two dimensional space like that and so now what I've got is two observations up I don't want the gray anymore um observation one and observation two uh and let's say I have one independent variable just one uh so there we go I'll call that x now I've got a one dimensional space floating in two dimensional space so one dimensional space is is a line a two dimensional space is a plane so I've got a line in a plane uh that's just a more colloquial way of saying what I said up here uh the space formed by the columns of your independent variable matrix is a subspace of the space created by observations and what this should make clear to you is why we need n to be bigger than k when we run regressions which is to say why we need more observations than we have independent variables so continuing with the two observation case so here's observation one observation two again if I have two if I have one variable that's kind of fine right I've got a nice subspace uh if I've got two independent variables now I can reach any point in observation space with these two variables and actually this is a way of saying that my dataset is going to be kind of perfectly determined I'm going to be able to perfectly predict it in a in a sense of the word um if I start throwing in even more variables here's x3 now I actually don't even have I have more variables than I have observations so I don't really need that third variable to do any prediction in this space it's superfluous in fact I could predict x3 with x1 and x2 uh x1 and x3 is no longer linearly independent from x1 and x2 in a sense so um that's that's why we need that's why we need more observations than we have variables and that's even going to become more clear as we proceed with the demonstration and I start um showing you how what linear regression is doing um to in this uh subspace all right uh what's uh what is ols doing here well what ols is doing is the following so uh you remember from from last week that we know that something is done to the y variable using that uh the dependent variable that is using the matrix of independent variables to produce predictions y hat and in in particular we we know that y equals x beta hat right and we know that we get beta hat from I'm sorry y hat equals x beta hat so the prediction of y is x beta hat and we know that we get beta hat by taking this is the formula you should have tattooed on your arm uh x transpose x inverse x transpose y right so that means that y hat predictions are given by I'm just taking this x times this beta x x transpose x inverse x transpose y so we know that something is done to the y variable using the x matrix to produce predictions y and that something is defined by this matrix here so just store that for a second uh furthermore we know that a prediction plus the estimated error term adds up by definition to the observed value of the dependent variable so what that means is that these two vectors added together need to add up to or need to form some kind of triangle in vector space so if I've got again two observations observation one and observation two here is let's say my dependent variable uh I'm gonna move this around a little bit actually um let's try this um I know that my prediction uh and my you have to add up in some way to y so for example what might be the case is I've got here's a prediction vector right here this is going to be y hat that's the variable corresponding to the predictions I generate out of an ordinary least squares regression uh u is going to have to add up to y so u hat's going to have to be that guy there so if I can move this down a little bit move this guy down here move this guy right there these two things have to add up to y furthermore they have to be at right angles to each other you notice that I intuitively drew this at a right angle they have to be at right angles to each other well how do we know that well the reason we know that is because if they're not at right angles to each other you hat would not be as short as it could be and one of the key facts we learned from last week's lecture is that the point of OLS or the goal what OLS does is pick beta hat that is to say pick y hat that makes the estimated error terms as small as they possibly can be so imagine for example that y hat um was actually out here something crazy right so it's going to call this y hat prime something bad that would imply a u hat prime to add up whoops to add up to uh y we would need u hat prime to be this and you can see that's really really long so in essence why do or why do uh y and u hat y hat and u hat need to be right angles at each other well if y hat and u hat are not perpendicular then the length of u hat u hat's length is not minimized but we know that OLS determines the beta hat and y hat that results in minimized length u hat so it's got to be the case that those things are at right angles otherwise we have we have a problem uh incidentally I don't want to go into this too much but uh the length of well now I'll just skip that actually we'll get back to that later so um linear regression represents a projection of the y vector onto the space defined by x and minus k well what do I mean by that let me go back up to this uh diagram right here I'm going to pull this out I'm going to copy that and I come down here and uh paste it uh let me get some of the I'll get rid of the extraneous uh extra vectors here actually I think I can erase them yeah there we go let's get rid of those okay so here's y hat u hat and y okay y is this vector right here y hat is equal to x times beta hat we know that right so that means that y hat is a stretching or shrinking of the x vector so the x vector might be something like this in this case x is a line so we know that uh there's only one x variable in a two dimensional space and beta hat is what shrinks that line that vector to minimize u hat if you think about it uh well so okay so let's well let's just stop right there for a second what I want to talk a little bit more about is is what we mean by uh projection um so what do I mean by a projection well imagine there are lots of different ways of thinking about this but I'm going to start with a kind of intuitive one and then get to a little more technical one so here's my diagram here and imagine that I've got some kind of uh light source do I have any yellow yeah here we go here's the sun shining uh that sun is projecting light down on y and x is kind of like the ground in this scenario and y is kind of like a um pole shooting up out of the ground in some ways right so it's shooting up into the sky y hat is like the shadow that the y pole casts onto the ground formed by x it's that's why it's sometimes called a projection because it's a literally a projection of this imaginary light source onto x um u hat is the distance from the tip of the shadow to the edge of the pole floating in space which is another way of saying it's the part of y that can't be represented by a line on the ground in this case a line on in x in the space defined by x um so it is in uh what ordinary least squares regression is doing is uh creating this projection a more technical way of thinking about this is to say there's a part of y the dependent variable that can be explained by x the independent variable but there's a part of it that can't be and what ordinary least squares regression does is find the part of y get as close to the tip of y as you can by moving in the space in the linear space defined by x in that case in that point in this case that point is you know right about here that's as close as i can get to the tip of y right here but just by moving linearly in x space there that's not all the way to y right there's this additional distance to travel u hat but it's as close as i can possibly get so x beta hat is is that point in the x subspace that's closest to uh the point of y in the larger subspace so x here you notice is a subspace area is a space in r1 the observation space is larger it's a space in r2 x beta hat is is the point in in this r1 line that's as close as i can get to the y vector in the art in r2 space uh so that's that's what that that's what that is uh it might also be helpful to to see this um kind of uh uh mathematically so as i as i mentioned before here's y hat here's beta hat and y hat equals x quantity x transpose x inverse x transpose y right well so y hat equals x x transpose x inverse x transpose y this part right here oops try that again this bit right here here is a matrix called px or the projection matrix for the space defined by x uh what dimension is px well x is uh let's see x is n by k right this is n by k this is uh transpose that's k by n so that conforms that's n by k uh this whole inverse is going to be also k by k and then this here is going to be uh k by n so this p matrix is n by n and what it does is it takes a vector like y as input and it projects that vector onto the space defined by x so it tells you how much of any vector input can be represented in x space or how close can you get to that vector uh in x space so another way of writing this equation over here is that y hat equals p x y these two things mean the same thing p px is uh just a shortening of of this longer matrix equation all right so what this leads us to is our next fact uh orthogonality uh x beta hat and u hat are orthogonal by orthogonal that word we mean a bunch of different things all of which are equivalent to each other the first thing we mean uh is that the angle between x beta hat and u hat is 90 degrees and I hopefully convinced you of that uh up here when I showed you this little diagram uh demonstrating that if that were not so the length of u hat uh would be um too long it would be longer than it needed to be and therefore u hat would not be uh minimized as ordinarily squares as we showed last week uh does um there's actually a little proof of that that I can show you uh pretty quickly so I'm gonna take this um diagram here real quick so we've got let's just say we've got some y and we've got some x beta no and we've got some x beta and we've got some u so this is y x beta hat and u hat uh and this line here is a perpendicular from the intersection of y u hat to x beta and that perpendicular it forms a 90 degree angle like so okay so what is uh uh the length of u hat well uh by the Pythagorean theorem uh we know that um this height here that perpendicular has uh length a we know that u hat squared equals a squared plus x beta hat minus j squared where j is that length right there j from there to there so all I've done is I've said okay this length here has to be whatever this length is that's a a squared plus this length squared which is x beta hat which goes all the way out to there minus whatever length j is so all I'm doing is the usual Pythagorean theorem deal where I say the opposite side squared plus the adjacent side squared equals the hypotenuse squared as long as we have a right angle here which we do uh so that's the length of u squared now what value for j minimizes the length of u uh the length of u so actually I could I could rewrite this as u hat the length of u hat equals the square root of a squared plus x beta hat minus j quantity squared there we go so what value for j is going to minimize or what value for x beta hat I'm sorry is going to minimize uh this equation well I think just trivially without doing any calculus or anything I think it's going to have to be x beta hat equals j how do we know that well this is squared so negative negative values are not going to help us they're not going to shorten this vector so basically what we want to get this vector down to is the small number as we can get which is going to be zero that's going to be true whenever uh j is exactly equal to x beta hat so uh I think it's the case that we want to get x beta hat equal to j that is to say u hat to be at a right angle in order to make u hat as small as it can be so it's just like a real simple proof proof like thing using the Pythagorean theorem of this first point if you weren't convinced before that statement is equivalent to the statement that u hat cannot be represented by vectors in x uh that point comes out of the following demonstration so remember that p x is what projects um a vector y onto the x space but by definition u hat is at right angles to x beta right so what that means is if we try to project um if we try to project u hat into the space defined by x beta it would be like asking what kind of shadow a pole facing straight up casts at noon so imagine here's our little space here and uh imagine I've got a y oops put a point on that I've got a y here and I've got an x beta there's an x beta hat and I've got a u hat now remember we got the sun right about here here asking what part of u hat lies in this x beta space how close can I get to the point of u hat uh in in uh on x space and the answer is I can't get I can't get very close at all imagine moving uh u hat down to the origin so I've just moved u hat down to the origin how could I move along this space x beta to get closer to u hat well I couldn't get any closer than the origin that is to say I couldn't get any closer than beta hat equals zero right I that's if I go if I make a negative beta I'm going to move out here I'm going to get further away if I make a positive beta I'm going to move out here and get further away the closest I can get is zero so you cannot be represented by vectors in x and we've incidentally just also shown four a regression of x on u hat is going to give you a beta of zero equivalent to all of these statements is that x times x transpose x negative one x transpose u hat equals zero or alternatively px u hat equals zero there is no projection of u hat on the x now one thing I want to emphasize to you is that these are all properties of estimates not of the data generating process of the real world we're not talking about u and x beta here it is not necessarily true that if there's a real data generating process that's linear x beta and a real error term u that those two things will necessarily be orthogonal what I'm telling you is if you're in a regression and you fit y hat and you fit u hat the fit of u hat will uh if you regress y hat on u hat you will get zero correlation by definition no matter what the data generating process is period that's that's that's my point so if I go over into our studio here and I just pick an x and a y and I run those and then I model I run a linear model of y on x generate predictions y hat I'm going to call that model dot pred and then get residuals from that model which is just going to be y minus y hat or model dot pred so there we go and then I try to predict the residuals with the fitted y hat so I'm going to basically run a regression a scatterplot version of regression of residuals on the predictions there it is right there look what happens I get what amounts to a zero correlation no correlation at all which makes perfect sense given what I just said and if I run a formal regression here you can see that the uh coefficient on that regression is as close to zero as I can pretty much get without it being zero in machine language you know nine or negative 3.2 times ten to the negative 16th that's zero so uh one last point about this um you know everything everything I've just said holds in higher than two-dimensional space all my diagrams have been in two-dimensional space because they're easier to draw but all of these four points that x beta hat u is 90 degrees and all that that all holds in as large as the data set can get a thousand data point data set will all these things will be just as true as they are in a two observation data set so just to give you a real quick sense of what this might look like so um let me take a three dimensional space this is observation three observation one observation two um so imagine I've got uh two vectors um and for uh ease of exposition I'm just going to put those two vectors in sort of a real flat space so it's you know I'm going to put x1 here and I'm going to put x2 there this is going to be x1 it's going to be x2 and imagine y is kind of shooting upward like that so this is like y um if I was going to try to represent y in the space defined by x1 x2 I would probably just visually it would look something like something like that and there'd be a u hat defined there this would be x beta hat right there or y hat and these two things would be at right angles like so um and what I've done is I've basically said how close can I get to y on the plane defined by x1 x2 um if I ran the residuals from this regression against y hat I would get zero correlation just as I got in the two dimensional matrix so nothing special no there's no trick in this two-dimensional case um it it all it it always works and I hope you saw that from the the r studio example so going back to our studio here you can see that um this data set has a hundred observations in it so y equals 2x plus 3 plus a normal distribution um centered on zero with standard deviation of 1 um that's a hundred that's a hundred dimensional space and even in that hundred dimensional space when I predict try to um run a linear model predicting y hat or predicting I'm sorry predicting u hat with y hat um I got nothing even in that hundred dimensional space now the reason I'm telling you this um is for for several reasons you may have already heard some of them but if not you'll hear them again um you may have heard at some point and if you haven't yet you will eventually hear that it can be a problem for the error term u to be correlated with your aggressors x there are lots of cases where that might be a problem um and some people's initial reaction is to say oh well I should I should check for that so I'm going to run this regression um you know I'm going to y equals x beta I'm going to run that and then I'm going to estimate the residuals out of that so like y minus y hat equals u hat okay I got my sentence residuals okay now what I'm going to do is I'm going to see if x predicts u hat the answer will always no matter how bad or good your regression is it will always be no and that's not true because of any property of the world that's a property of OLS OLS is designed to work that way so that's that's what you're going to get if you do that and you shouldn't use that as a as a diagnostic tool okay uh so let's talk a little bit about variance decomposition in the sum of squares uh so all right recall that x x transpose x so this should be x transpose x inverse go up there one yeah there we go x transverse inverse x transpose equals the projection matrix px projects any n by one vector into the space defined by x there first is an equivalent residual matrix mx let's talk about that a little bit so mx is i n by n uh minus p which we've already discovered is n by n and what the projection matrix does is give us the residuals so if for example I were to take a vector y and pre multiply it by mx what I would get is u hat so mxy plus pxy equals y again I've got the residuals from y and the fitted values for y so I've got u hat plus y hat that's got to add up to y for reasons we've already established um and as I just mentioned the Pythagorean theorem says that x beta squared plus u squared equals y squared so what is this um thing that I've got here well this double set of bars with something inside of it is called the euclidean norm and that's just a way of denoting the length of a vector so what we're saying is that the length of x beta hat squared plus the length of u hat squared has to add up to y squared so y hat squared plus u hat squared equals y squared okay so we've been we've been over that hopefully uh you know that part uh what this these facts together give us is um information about the degree to which x can explain the variance of y okay what do I mean by that well I'm going to start by just writing the length of y squared equals the length of x beta hat squared plus the length of u squared okay uh u hat squared sorry okay this is equivalent to y transpose y uh y is that the case well that that is the case because the euclidean norm of some vector x equals the sum of all the components of x squared and then the square root of the sum again this is implied by the Pythagorean theorem if I take every part of x square it and then add them up and take the square root I will get the length of x so imagine here's one part of x x1 here's another part of x x2 there's the total length of x a squared a squared plus b squared equals c squared that's all we're doing there's no real trick here so y transpose y we learned in the first and second weeks of class is equal to the sum of squares of some vector so the euclidean norm of y squared is just y transpose y it's this squared uh so that's going to equal x beta hat transpose x beta hat plus u hat transpose u hat okay um this is sometimes called the total sum of squares this is the explained sum of squares and this is the residual sum of squares so uh what we mean by this is that there's some degree of variation in y that can be explained with x and that variation is representable by the sum of squares of x in fact if x or i'm sorry if y has a mean of zero then the variance of y equals the sum from i of one to n of y minus mu uh y i squared like so that's the variance of y well if mu y equals zero this equals the sum of i equals one to n y i squared or y transpose y so if in a special case where you've just transformed y to have the mean equals zero the total sum of squares is the variance of y so the e s s is the proportion of the variance of y or the degree of variability in y that can be explained by movement in the x plane and the r s s is that which remains the part of the variance or variability in y that can't be explained by x um this identity t s s equals e s s plus r s s is often summarized in the r square statistic which you've probably heard about in your previous classes r squared is the explained sum of squares divided by the total sum of squares or colloquially the proportion of variation in a dependent variable y that is explained by x and you'll notice that uh r squared has to lie inside of zero and one and that's true because these are all positive quantities you you hat transpose you hat there you go you hat transpose you hat is a positive quantity it's a sum of squares this is a positive quantity it's a sum of squares it adds up to a positive quantity which is the sum of squares which implies that this cannot be any bigger than this if it were you hat transpose you hat would have to be negative and that's not possible so e s s divided by the the largest that e s s can get is t s s so the largest they can get is one the smallest it can get is zero for the exact same reason if this were less than zero that uh well that's just not possible because the sum of squares is not less than zero so the smallest this can get is zero so the smallest r squared can get is zero so we've got a nice summary statistic telling us in short how much we can explain what the variability in y by a linear projection of x all right now what i'm going to do is talk about some facts about projection matrices and some of these will become more important as we talk you know further our discussions of ols but for now there's sort of interesting curiosities which will you have to trust me they'll become important later so the first is that p x or the projection matrix in x is idempotent idempotent is a big word that means that uh it's a matrix that when it's multiplied by itself gives you itself back so here it is right here px px equals px why is this true intuitively well all right so let's uh think about this for a second so here is uh a little observation one observation two here's a little diagram and here is uh some variable x and here is some variable y and uh if i were going to project um y onto x i would get something like like this so that is the projection p x y and then that's going to leave me the residual residual vector here m x y like so now what would be true if i tried to what would happen rather if i tried to take p x y and project it onto x this is a little bit like asking if a shadow could cast a shadow what shadow would it cast a little zen koan go along with your uh ordinary release gross regression the answer is it would just cast itself because it's already the component of y that lies in x you're already anytime you multiply by a px matrix you're locking yourself into the space defined by x so any vector that's already in the space defined by x is by definition representable by itself in x uh now that's an intuitive explanation as best i can give it but you'll also formally prove that for homework uh mx is is idempotent as well and it's that this fact is is true intuitively for a similar reason to the one above so remember that mx tells you the proportion the portion of some variable y that can't be projected onto x so here's mx y right here in this diagram uh mx mx y is like asking um this is a little less this is a little easier order to explain intuitively but you're asking what component is left or is not representable in x of some there of some vector y right so this is the portion of y not representable in x and then we're asking what portion of that is not representable in x well the answer has to be all of it so anytime you've multiplied by mx doing that again is like you know sort of the inverse of projecting a projection you're just going to get the same thing back again but if you don't believe me we're going to prove that for homework as well so perhaps you'll get it that way uh mx px is not just a bad punk band it's also zero mx px equals zero which equals px mx why is that fact true intuitively well if you consider again the meaning of px and mx up here um what you'll see is that what you're asking is if i take pxy say for example what's the portion of y representable in x so i've got pxy the portion of y representable in x okay then i say mx pxy what portion of the above is not representable in x well the answer has to be none of it because pxy is only the part of y representable in x so none of that part is not representable in x that's what we made it to be um so mx pxy has to give you back nothing because you've in essence constructed something that doesn't exist by by default but we're going to prove that fact for homework as well now i've given you a bunch of proofs for homework so it would be nice if i showed you how one of those would proceed so here's another fact um px transpose equals px and mx transpose equals mx now i'm going to actually let you do one of these for homework i'm going to let you do um i'm going to let you do mx uh yeah i'm going to let you do mx for homework so i'm going to let you do this one here for homework uh but this one i'm going to do for you right now so let's just start with px equals px prime now a way of writing that long hand is x x transpose x inverse x transpose equals px transpose so this whole thing x x transpose x inverse x transpose the whole thing transposed so this is just going to be a bunch of applications of uh rules of matrix algebra that i taught you a couple of weeks ago so what i'm going to do is start on the right hand side and just try to make it look like the left hand side with rules now uh this right here is a matrix transposed so what i'm going to do is i'm going to say okay i'm going to call this here a and this here b and we know that a b transpose equals b transpose a transpose uh so this is going to enable me to write x transpose transpose or x times x x transpose x inverse transpose so the transpose of a transpose is itself again that's what i've done here so this is b and then this is a transpose okay so far so good uh now i'm just going to do that same thing again i'm going to call i'm going to call that a i'm going to call that b i'm going to use the same rule again so what i'm going to get out is equals x x transpose x inverse transpose x transpose okay now we learned a couple of days ago i guess it was a couple of weeks ago now a inverse transpose equals a transpose inverse it's a rule of matrix operations and i'm going to apply that right here to switch out these two superscripts so i'm going to be able to say x quantity x transpose x transpose inverse x transpose now what's x transpose x transpose well by this rule it's just x transpose x again x transpose that is that proof complete so that's what a proof looks like let's one little example of it and you should be able to play around with the rules of matrix operations to prove all of these things uh and none of them are especially deep or you know there's no like a magic move you know none of these are Nash equilibrium type proofs these are you know fairly simple proofs that you can get just by sort of doing like i did putting both things on one thing on one side of an equation the other thing on the other side of an equation just messing around with them until they look the same that that's pretty much a winning strategy for all of these proofs all right finally consider x equals x one x two so this is what i what's called a partitioned matrix so what i've done is i've said okay this x is you know like x one x two x three blah blah x k right it's a collection of each one of these is a column vector what i've done is i've said okay collect some of these column vectors into one matrix call it x sub one and collect all the rest into another matrix call it x sub two so we can write that this is like x one x two so this is like n by k one and this is n by k two so i've just broken this matrix into a set of two column vectors uh what we can show it here is that p one p x now p one is the projection matrix for x one for this matrix right here p one p x where p x is the projection matrix for all of x all right so uh p one is the projection matrix for x p x is the projection matrix i'm sorry p one is the projection matrix for x1. Px is the projection matrix for the entire matrix x here. So what we want to show is that P1 Px equals Px P1 equals P1. I'll start with that fact right there, right there. What we're basically saying here is that if we project, we first take the projection onto the space, the entire space of x, and then we take the projection of that projection onto the space of 1, what we have left is just what's on 1. So let me try to give you a visual explanation for this and then I'll give you a formal mathematical proof of it. So what I'm going to do is take a three-dimensional space like this and I'm going to label some stuff here. I'm going to say this is x1, x2, and y. Okay, so y is a vector that juts out into this space. Should I take that back? So this is observation 1, observation 2, observation 3. This is y. Here we go. Get your head in the game Justin. This is x1 and this is x2. Okay, so if we take the projection of y onto x, so in other words, the first thing we're going to do is take Pxy. What we're going to get is some vector down here that looks like this, right? Something like that. The shadow cast by y onto the plane defined by x1, x2. So we'll call that Pxy. Now, let's take Pxy and project that, let's just say onto x1. What are we going to have then? Well, what I want is the portion of this vector that representable on that line space right there. In other words, something like this. In other words, I'm going to get the same thing I would have gotten if I just directly projected y onto x1 only. So it's a little hard to talk about that intuitively, but in so much that we can, what we're saying is when we look at the portion of y that's representable on the entire x space and then take a portion of that that's representable on just the x1 part of the space, we might as well just ask what part of y is representable on the x1 space. And the same thing is true if we go in the other direction, I should say actually let me just say it like this. It doesn't matter which one we project onto first. We can do P1, Px or Px, P1. Either way, we're going to end up with just the portion of y representable by x1, or representable on x1 rather. So that's the informal talking it through idea. It's a little bit rough though, so let's maybe try another tactic and look at a formal proof. So what I want to show is that Px, P1 equals P1, Px, and that all of this equals P1. So what I'm going to do is the same thing I did last time is just write out these definitions and see if I can make them equal each other. So I'm actually going to start with this bit right here. So now I take that back. I'm actually going to start with this bit right here. So what is P1? Well P1, P1 is x1 times x1 transpose x1 quantity inverse x1 transpose, right? This is Px times this is just the long hand definition of P1. Now what I want to do is say, okay, what is Px, x1? Well Px, x1 is the portion of x1 that's representable in x space. What portion of x1 is representable in x space? All of it. So this just becomes x1, x1 transpose x1 inverse x1 transpose, or P1. So I can show that this equals this. That's what I've just done. Now I could, if I wanted to, do the same thing for this part. All I need to do is write, alright, well let's see x1, no, erase, x1 transpose x1 inverse x1 transpose Px, right? I've just translated that into its long hand format. And then what I would want to know is, alright, what is x1 transpose Px? Well by the rules of transposition, that's Px, wait a minute, right? So that is Px transpose x1 quantity transpose, right? I've done anything wrong there. I'm just saying is it like A, B transpose equals B transpose A transpose. So there's B transpose A transpose. So this whole thing transpose again has to be that. So what's Px transpose? Well earlier we noted the theorem that Px transpose equals Px. So we can rewrite that as Px, x1 quantity transpose. What's Px, x1? We already decided, up here it's x1. So we get x1 transpose, x1 transpose. So now we can then take that and substitute it in right there. And we're going to get x1 x1 transpose x1 inverse x1 transpose, which equals P1 proof complete. So it's not really that complicated to show that both of these things equal P1 in a formal sense, but I hope that you have at least a little bit of an intuitive sense of it. And in case you don't, at least you'll have to formally prove this proposition right here in a way similar to the one that I just demonstrated. Okay, so why did I go through that? Well, you know I like it, but that's not really the real reason. It allows us to state a theorem, which it turns out is extremely important to ordinary least squares regression, and it's implicit in many of the things that we do quasi unthinkingly. That theorem is the Frisch-Wahl-Level theorem. The Frisch-Wahl-Level theorem says the following. Consider two regressions, y equals x beta hat plus u hat, right, this is the usual regression. And now imagine, oh this is slightly re now now written. There we go. Now consider an alternative regression, m1 y equals m1 x2 beta 2 hat plus residuals. And so what I'm saying here is imagine what we do is we take the components of a data set and we run y against x1 and then get the residuals out of that regression. Then we regress x2 on x1 and get the residuals out of that regression. Then we model the residuals from the first regression against the residuals from the second regression and get another whole model. The Frisch-Wahl-Level theorem says the estimates of beta 2 from this regression are the same as the estimates of beta 2 in the original regression. What does that, what did I just do there? Okay what am I saying? Here's what I'm saying. If I've got a whole bunch of variables, what I can do is run them all individually as separate regressions or I can run them all in one gigantic grand regression and either way I'm going to get the same results. Now you might be saying to yourself well there's this thing called spurious correlation and I'm really concerned about how you know I don't want to just run y on a single x variable because if I do that you know there might be this other x2 variable that's spuriously correlated with x1 and so that's going to mess up my estimate of how x1 affects y. Well the Frisch-Wahl-Level theorem says well what you should do or what you could do, one thing you could do is just run a regression of y on x1 and x2. Another thing you could do is you could run a regression of y on x1, predict the residuals, run a regression of x2 on x1, predict the residuals and then do a third model of the residuals against the residuals and that regression is going to give you an estimate of the effect of x2 on y that would be the same as if you'd run the giant regression. Alright so that's my best attempt to talk it through. I think there's one application that's going to make it crystal clear how this really works. What I want to do is do a formal proof of the FWL theorem just to show you that mathematically what I'm saying is correct and then we'll move on to a substantive more informal demonstration in our studio. So here's the Frisch-Wahl-Level theorem. So what I'm going to do is run a regression on m1y using m1x2 which is regression 2 in this statement above. So the definition of that or what that regression is, recall the formula you should have tattooed on your arm at this point is x transpose x inverse x transpose y this right here gives you beta hat. So what I'm going to do is I'm going to say okay well beta 2 hat from this partition vector up here is equal to m1x2 transpose m1x2 inverse m1x2 transpose m1y. Okay what did I just do there? Well what I done is I've said okay this the role of x will be played by m1x2. m1x2 is the residuals of a regression predicting x2 using x1. So in R this would be like lm x2 squiggle x1 and then predict the residuals out of that to get m1x2. That bunch of residuals is going to be our x variable here. So we take x transpose x inverse x transpose y. Now the role of y right here will be played by the residuals of a regression lm lm y on x1. So run that regression and predict its residuals and that's going to play the role of y. That should be beta 2 hat. Another way of I can simplify this a little bit. I can simplify this by saying okay m1x2 transpose equals x2 transpose m1 transpose m1x2 inverse m and I can simplify this as well by saying x2 transpose m1 transpose m1y m1 transpose equals m1. We proved that earlier in today's lecture. So this is going to be m1x2 inverse x2 transpose m1 m1y and as we also learned in today's lecture x is idempotent. So m is idempotent and m1 is idempotent. Any projection matrix of the residuals is idempotent. So furthermore we can write this as x2 transpose m1x2 inverse x2 transpose m1y. So that is our estimate of beta 2 hat from this second regression. Now what I want to show is that the estimate of beta 2 I get from running this crazy residuals regression, this beta 2 estimate here, is going to be the same as the beta 2 estimate I would get if I just ran, you know, a normal person regression. And if I ran a normal person regression, so this is the regression I get from running these residuals on these other residuals. Now suppose I ran the normal person regression lmy on the whole thing of x. If I did that, what I'd be doing is the following. I'd be doing y equals pxy plus mxy. So this is y hat and this is u hat. So that's just a simple identity. And that would give us, I could partition this out as x1 beta 1 hat plus x2 beta 2 hat plus mxy. Now these are actually matrices. Okay, so what I mean by that, whoops. There we go. What I mean by that is this is n by k1. This is n by k2. So I've got like two many, I've got two partition matrices here that I've broken up. So now I can rewrite this as, rewrite this as, no, okay. x2 transpose m1y equals x2 transpose m1 x1 beta 1 hat plus x2 transpose m1 x2 beta 2 hat plus x2 transpose m1 mxy. So what I've done here is I have premultiplied by x2 transpose m1. Now I can start to simplify things. For example, I can simplify something right here. I can use a theorem which we just proved a second ago, or which actually I'm leaving part of that proof to you in the homework, but this here is going to give me mx. That simplifies to mx. I can simplify this to zero because m1 x1 is the portion of x1 not representable in x1, which is to say none of it. So this is just going to die. And so what I'm going to have left is x2 transpose m1y equals x2 transpose m1 x2 beta 2 hat plus x2 m, or transpose m1 y. Okay. Now, I apologize, this should actually be x2 m, x2 mxy. Okay, now I can simplify this a little bit. I can say this thing here is like mx x2 transpose y. Well, what's mx x2 transpose? What is that? This thing is equal to zero. It's that what portion of x2 is not representable on x? Well, it has to be none of it, right? So what that means is that this here equals zero, so this whole term equals zero, so I can just kill that whole term. So now I'm down to x2 transpose m1y equals x2 transpose m1 x2 beta 2 hat. Okay, now what I'm going to do is pre-multiply by this. So I'm going to say x2 transpose m1 x2, I'm sorry, I'm going to pre-multiply by the inverse of that. It's a very important difference. x2 transpose m1y equals beta 2 hat. Well, what is this? Probably asking yourself that right now. What is this? That is the exact formula we derived for beta 2 from this residual regression. That and that are the same. Proof complete. I can run this crazy residual regression and get the same answer I would get through a normal person just putting all the variables in style regression, which is an interesting fact in and of itself, I guess if you're a nerd. But it turns out that it has very important implications for, I'm not going to say non-nerds, but lesser nerds who are more interested in data analysis. Okay, so I've just promised you that all that proof was actually getting us somewhere and now I've got to deliver on that. So why why is the FWL theorem important? Well, the FWL theorem is important for a couple of reasons and one of them is also an interesting way of showing a consequence of the FWL theorem. So point one I've got up here is that scatter plots can be created using a consequence of the FWL theorem that correct for spurious correlation. Everybody likes scatter plots, right? They're very visual, everyone can see what they are, what they're doing, it's real nice. The only problem that a scatter plot has is that well, one of the problems I guess the scatter plot has is that it doesn't have any way of correcting for spurious correlation if it exists. And so it can be misleading. This is one of the first lessons you probably learned in your lower level undergraduate statistics course. But via the amazing properties of the FWL theorem, we can construct a visually compelling scatter plot that actually looks good. So what I'm going to do here is take you through an example of this. So I'm going to create some variables. Let's see here. I'm going to start here in my R script. So I'm going to clear out what I've got. This is the remove command, clearing out everything that's in the memory. You just saw it all just disappeared. And I'm going to use this library called mbtnorm. mbtnorm is a library that allows you to draw variables or sample variables out of the multivariate normal distribution. And that's going to be relevant here because what I want to do is sample variables out of a distribution that's correlated or that are correlated with each other. So for example, I'm going to draw x here out of the multivariate normal distribution. I'm going to draw 200 samples. That's the 200 here. I want them to have mean zero. And I want them to have a correlation matrix or a VCV matrix. We haven't talked about it yet, but you may have heard it from Drew. They're going to be correlated at 0.6. That's going to be their correlation. So this is the sigma correlation matrix of 0.6. And if I draw this distribution and then I plot the first column of x against the second column of x, you'll see that these two variables are indeed correlated. In fact, I think I may even say, what's the correlation of x1 and x2? Oh, look at that. It's about 0.6, just like we said. So, okay. Now what I want to do, I'm going to bind a constant variable to x. And so now if I do ahead of this, the first few rows of x, you'll see I've got x1, x2, and then this constant vector. This is the constant term in a regression. What I'm going to do is I'm going to indent some betas. This is just some random betas I drew out of nowhere, my head. And I'm going to create a dependent variable that's x beta plus a normally distributed error term with standard deviation 1 and mean 0. All right. So if I run a regression of y on the x matrix, what I should get is something like negative 3x1 plus 1x2 plus a half x3, which is, roughly speaking, what I get. That's good news. This is the constant term x3. This is the second column of the matrix I drew. This is the first column of the matrix I drew. Everything's great. All right. So I'm going to take that data, and I'm going to just plot, actually I think I need to expand my plot a little bit here. I'm going to just plot the second variable, x2, against y, okay. And then I'm going to put in a regression line for that. And I'm going to compare that to the known true regression line that I created out of this fake data set. So I know the real regression line should have a slope of 1, because I said it should be 1. I made up this data. I know it's 1. If I plot a line with a slope of 1 in here, that goes in the opposite direction, we have a problem. We have a scatter plot that's telling us that x2 is negatively related to y, when, in actual fact, it's positively related to y. That could be a problem. And this is the short version of why scatter plots are not such a good diagnostic tool in cases where you have strongly correlated independent variables with effects that go in different directions. You'll see in my beta matrix here, x1 has a negative impact on y, x2 has a positive impact on y. Well, what's happening here is because I'm not modeling the effect of x1, x2 is sucking up that negative correlation with x1, and bam, it's giving me this crazy line that's going in the wrong direction. So you might conclude, ah, scatter plots look good, but they're misleading. Well, yeah, maybe not. We can create a FWL corrected scatter plot. So what I'm going to come in here and do is I'm going to say, all right, I am going to run a regression of y on x without the constant. Oh, I'm sorry, take it back. I already did that. What I'm going to do is I'm going to extract all the rows of x and the first three columns. So if I do a head of z, I'm sorry, I have column one and column three. So I've got x1 and the constant. I've extracted x2. x2 is no longer in this data set. So what I'm going to do is model y using x1 and the constant. Bam, I've just done that right there. This is a, this right here is a regression of y on x1 and the constant and not x2. Okay. Then I'm going to predict the y hats out of that and subtract the observed values of y getting the residuals of that regression. So now I've got y dot res, which is the residuals of that regression. Then what I'm going to do is that same exact process, except now I'm going to model x2 with x1 and the constant. So there's my linear model right there doing that. And then I'm going to predict the residuals from that regression right there and get x residuals. Okay. Then I'm going to run a model of the y residuals I just created against the x residuals. Oh, I've got to actually create those x residuals, don't I? Bam. Okay. Now, what did I do? What have I got? Well, here's the beta estimate 1.07 of the x residuals on the y residuals. Well, notice that matches exactly the beta for x2 that I got out of my grand regression. The FWL theorem says that is not an accident. The fact that that beta comes out to be the same is by no means a mistake. Now I can use that information to create a really cool scatter plot. So, okay, watch this. So what I'm going to do is I'm going to create a scatter plot of the y residuals from that crazy process against the x residuals from that crazy process. And that's going to give me what I would call, what I call an FWL corrected scatter plot. So this is x, this right here is x1, I'm sorry, this right here is x2 net of the influence of x1. This is y2, I'm sorry, is y net of the influence of x1 and the constant. If I put in a line corresponding to the beta coefficient I get out of a regression of these two against each other, it very closely matches the true value of the regression line that I get, that I created out of this fake dataset. In other words, I know the right beta. This is very close to it. So what this FWL process really kind of enables you to do is to create scatter plots that are not misleading in the way that a raw scatter plot might be misleading in the presence of spurious correlation. This picture gives an analyst or a reader an accurate idea of how x2 and y regress against each other or sort of look against each other net of the influence of spurious correlates that might be messing up the process. It's a really nice technique for presenting analysis in a paper or journal article or something where you want to give a reader a real good intuition of sort of how this data is moving together but you don't want to throw in a scatter plot that's kind of BS because you know that spurious correlation might or might not be moving things around in a misleading way. So this is a sort of really cool use of this theorem to create a product, namely the scatter plot, which I think is sort of immediately of immediate recognizable value to a lay analyst. But that's not all. There's more. Another thing that the FWL theorem demonstrates for us is that standardizing variables by centering them for example around their mean or maybe dividing by the standard error will not change our estimates of beta. We can in other words rescale our variables and not worry about that screwing up our regressions in any way. The key idea here is that demeaning or centering variables is equivalent to running regression of a variable against a constant, which is something we talked about in an earlier class. So for example, if I want the mean of some variable y, the mean of a variable y, I can get that out of a regression of y against its constant. So it's the beta that comes out of y regressed against 1. So if I call i a vector of 1's, then i transpose i inverse i transpose y is mu y. So if I demean a variable, so if I construct a new variable y minus mu y, what I am in effect doing is running a regression of y against a constant and extracting the residuals from that regression. And what that tells me is if I run miy equals mix beta, I will get the same beta hats that I would get if I ran y equals x beta, if I didn't mean center the variables. So in other words, if I want to standardize variables in some way by, you know, subtracting their mean, centering them on zero, dividing by that standard deviation or whatever, I'm not going to have to worry too much about really screwing up these regressions because the Frisch-Wall-Level theorem says that those transformations are neutral with respect to beta. So that's kind of cool. All right, so that's the conclusion of the lecture. Hope you enjoyed it. What we'll do in class is go over some problems and questions related to this lecture as a group. I'll answer any questions you have about the lecture material first, just then, you know, half an hour or whatever it takes to do that, maybe an hour. And then we'll spend the remainder of the time drilling in on some problems as a group and individually related to this material, and hopefully that'll provide an opportunity for you to check yourself and make sure that you achieved a good understanding of the material and also enable us to broaden our understanding of the, enable you to broaden your understanding of the material a little bit. Thanks so much, and I will see you in class.