 Hello, it's Monica Wahee, your library college lecturer here to ruin your day with chapter 4.2, Linear Regression and the Coefficient of Determination. So at the end of this probably painstaking lecture, the student should be able to at least explain what the least squares line is, identify and describe the components of the least squares line equation, explain how to calculate the residuals and calculate and interpret the coefficient of determination or CD for short. Alright so it's really cool if you have a crystal ball because then you can make predictions right you just look at the crystal ball. It's some nice equipment I've had friends who have them they're very nice to put out on your dining room table as a centerpiece. Unfortunately, though they don't really play much into statistical predictions. So what I'm going to show you in this lecture is how we use statistics for prediction instead of this beautiful crystal ball. So we're going to start by talking about what the least squares line is. And then we're going to talk about the least squares line equation, which is the crystal ball thing we use only in statistics, okay. And then we're going to talk about dealing with prediction using the least squares line. And finally, we're going to talk about the coefficient of determination. So let's get started. And let's get started with the term least squares criterion, right. So remember, criteria is plural and criterion is singular. And it means well criteria is stuff you need to meet right to be eligible like you have to meet the criteria for registration for college, right. Well least squares criteria is just one, which is awesome because then you only have to meet one thing. So one of the things you probably wondered when you were watching the last lecture is how do you know exactly where to draw this line when you have a scatter plot, like how do you know where to make the line the most fair. So in the last chapter when we plotted the scattergrams, I just drew a line there for demonstration, but there actually is an official rule as to where the line goes. Okay. And basically the rule is it has to meet the least squares criterion, okay. If it meets that criterion, there's only one line that does, then that is where the line goes. So how do we get to that? Well, this is roughly what it looks like. When you draw the line, there is a vertical distance from each of the dots to the line. Now, as you can see by the slide, sometimes the dots are below the line, and sometimes they're above the line. And so the word squares indicates that whether it's up or down, you're going to square it. So it's not going to be negative anymore, because whenever you square a negative, it becomes positive. So first, you're going to have to square all of these things, okay. So imagine you were just going to try it out, like you should maybe draw this line, and then you calculate the squares and you'd be like, okay, that's how many and then maybe you tilt the line a little and calculate the squares again. Your goal would be to add, when you added up all the squares to have the least ones. So the line belongs where it will cause the smallest sum of squares for the whole dataset. So if your software, which you're not, you're a person, right, but if you were software, you'd be figuring that out using your software brain is where where how exactly to tilt this line and where exactly to put it to minimize these squares, but we're people. So I'm going to go on and explain how people do this. So the trick is, if you can figure out where the line goes, you can draw it on the scatter plot and be right. But there is a challenge of knowing exactly where it belongs on the graph. And then also, you're probably realizing you don't always have a graph to draw it on. Like maybe you need to talk to somebody about where the line goes and you can't draw my picture. So how you explain where the line goes is you use an equation. And some of you may remember this and some of you may not. So I thought I'd do a little quick review about how lines and equations relate. Okay, so we're going to get into the least squares line equation. But first I'm going to give you a little flashback about algebra. And I'm sorry if this is painful. This is hard for me because I wasn't really that good at algebra. But I and this isn't statistics, this is algebra, but I just wanted you to remember this part. Okay. So back in algebra, there was a chapter where you were given these XY pairs and it was different from statistics because they all lined up on a line. See these pink dots are just perfectly on a line. Okay, and these are the XY pairs. And remember you had to graph this kind of like we had to do scatter plots. And then you were given this equation y equals bx plus a right. And that was the linear equation to describe this line. And you were like, okay, I don't get how to put this equation together with this line. The teacher would say, well, b stands for the slope of the line, right, because you have to know the slope, I mean, the line can be tilted any which way. And so if you know the slope, you already know something about the line. And in algebra, how you would make the slope is you calculate the rise over the run, right. And so there, you know, be in algebra was rise over run and you'd get the slope. And then you'd be like, great. You always needed another thing in order to define the line. Because if you imagine this line is in an elevator, it could still have the same slope, but go up or down, right. So we need to anchor it on the y axis somewhere. So a stands for the y interceptor where it spears through the y axis. And as you can see by the drawing, it looks like a is zero comma zero, right. But you don't have to look at it. What you can do in algebra is you to get a is what you would do is go since you'd filled in B, you just go grab an XY pair and plug the X and plug the Y and plug the B you just got in and back calculate the Y intercept, right. And that's how you would get the whole linear equation. And so that's how you would do it in algebra. And I just wanted to remind you of that because we do some similar things in statistics, it's a little different. But I wanted to remind you how to connect what a line looks like with how this equation works. All right. Well, welcome to statistics looks, those pink things are not on a line. So we want to make a line. But now you know about the least squares criterion. What you're trying to do is make a line that minimizes the least squares, right. So here we go. Remember how I was just talking about this linear equation back in algebra? Well, notice the difference. The main difference here is the hat, right, the Y is wearing a hat. And that's universally in statistics, whenever you see a letter or a number wearing a hat, it means it's an estimate. Okay, so of course, we're estimating why because if you look on that line, none of these dots actually falls on that line. And we don't really expect even an estimate to fall on that line just close, right, you know, because of the least squares. Okay, and so we almost have in a way the same goal we did back in algebra, we have to get that B, that slope. And then we have to use that to back calculate our a. Okay, so let's go on with that. So like I said, in the software approach, you just feed all the X, Y pairs in. And then the software just actually prints out the B and the A. It just prints out the slope in the Y intercept, which is why I love the software. But we don't get to use that in our class. In our class, we have to do the manual approach just because it's painful. And I had to do it too. So now I'm making you do it, right mean. Okay, what we'll do is plug all the X, Y pairs into an equation to get this slope, this B, and I promise you, I won't give you a ton of X, Y pairs, you know, or you'll be there forever. But this next step we have to do, we didn't have to do an algebra. And that is we're going to have to go back to all of our X's, calculate X bar, and go back to all of our Y's and calculate Y bar. Remember, that's the mean of the X's in the mean of the Y's. And you're probably wondering, well, why do we have to do that? I'll show you again, but in case you didn't notice, those dots really didn't fall on the least squares line, they fell around it. And you need a dot at least on that line to help back calculate that Y intercept. And the rule of the least squares line, one of the rules of it, is that X bar comma Y bar is on that least squares line. So you can know, if you calculate that out, that that's actually on the least squares line. Okay. And so finally, after you do X bar and Y bar, you plug in B, and you plug in X bar for the X, and you plug in Y bar for the Y hat to back calculate the A. So it's a similar, but different process as algebra. So the moral of the story is you need to recycle, right, we got to be good to the environment. So what has happened? Well, you wouldn't be at this point in your life of making a least squares line. If you hadn't already started out by making a scatterplot, and then deciding you wanted to do R, and then making R. And when you make R, you end up with that big table, remember, and you end up with all these calculations like sum of X, sum of Y, sum of X squared, and sum of X, Y. Now you want to recycle those, you want to save those calculations from R, because they fit also into the equation for B. So you want to recycle that also you want to save the R you made, because you're going to recycle that into the coefficient of determination, which I'll explain later. And then this is not about recycling, you'll actually have to make this a new, but you need to calculate X bar and Y bar now, you never needed to do that before now. But now you need this. And so yeah, so get together your old R calculations and then put your X bar and Y bar together and you'll be ready to do the least squares line equation. All right, so here's a flashback remember this big table. Remember our story, we had seven patients, right? And X was their diastolic blood pressure at the last visit they had of the year. And then why was the number of appointments they had over the year and we thought well if your diastolic blood pressure, you know, goes up, then maybe you need more appointments because it's marker being sick. I don't know, that was my little story. Okay, so over on the right now, we'll see that the formula, we have the formula we're using for B, the text gives you two formulas again, I've always got my favorite. It's the one with the table, right? So here's a formula for B. And then after you calculate B, you'll notice in the formula for A, B is in the formula for A. So you got to do B first, right? So a lot of times students are a little confused in what the goal is here. The goal is to, if you look at the bottom of the slide, the goal is to come up with what B is and what A is and then fill it in and that's your least squares line equation. So your least squares line equation is always going to have an a Y hat in it. That's a variable that just gets to stay there. It's always going to have that equals. And then after that, whatever your B is, is going to be mushed up next to that X. So it's always going to have that X there. And then plus, and then whatever you get for A. And just as a trick, if A turns out to be negative, then it ends up being minus A, right? But that's the generic equation. And our goal is to calculate B and A and fill them in. And then we will say, this is our least squares line equation. Oh, remember how I was saying, you actually need to make some new calculations, right? So you need to make Y bar and you need to make X bar. And it's a little easier to show when I've got this column, the columns up. If you look at the bottom of the slide, remember how sum of X was 678. And remember how our N is seven. And remember how sum of X divided by N is your X bar. And the same goes for Y, right? We have the sum of Y divided by seven. I just wanted to quickly remind you of this, that you need to generate these things before you can actually completely finish the least squares line equation. I just summarized, like, I cut to the chase, basically. I just summarized the actual numbers you're going to need and put them over here. So we don't have to look at that whole big table anymore. All right, and you'll notice that I grayed out the sum of Y squared. Because I realized later we don't really use that, okay? So let's look under, on the left side, under the big list of numbers we have. And you'll see the B equation that I filled in, right? And if you compare that to the formula on the right side, you'll see what's going on. You know, that N is seven, right? So wherever you see that seven, that's where N is, okay? In the top of equation, remember sum of X, Y? Let's just look that up. Yeah, that's that big number, 18,458. I wanted to just be clear. You have to do out that left side the seven times the 18,458. You have to do that one out. And then do out the right side, which is that sum of X times sum of Y, which is 678 times 166. You have to do that one out. And then after that, you have to subtract the right one from the left one because of order of operation, okay? So that's how you make the numerator. Now let's just look downstairs. Again, we have an N, so we know that's seven. And then that's sum of X squared. And remember, it doesn't have the parentheses around the sum of X squared. If it had the parentheses around it, you'd be taking like 678 and squaring that. But it doesn't have the parentheses, so you have to use that big number 67892, okay? And again, like with the upstairs, you got to do out that side of the equation, right? That term, you've got to multiply that out before even looking at the rest of the equation, right? And then, oh, here we go. On the right side of the denominator, we have sum of X squared. That's exactly the example I was giving earlier. So you say 678 times 678, and you have to do that one out, right? And then after you do that one out, and you do the first one out, then you subtract the second one from the first one. Remember, order of operation. And if you do it right, you should get, see below the, on the left side of the slide, you should get that for the numerator and that for the denominator. And then you divide them out and you get 1.1, and that's your B, right? So there you go. That's how you do it. And so now we've got to worry about A. So what I did was I just wrote B at the top there. So B is 1.1. And so now we can use B to try and figure out A. So remember how, look at my list, remember how I did X bar and Y bar for you, just so we had that ready? So now we're going to calculate A by putting in Y bar minus, and remember order of operation again. We got to do the B, which is 1.1 times X bar. So we do that one out first and then subtract it from 23.7. And remember, remember how I was saying sometimes you get a negative A? Well, we got negative 80 for A, all right? So we got our B, we got our A, and let's go. Now, oh, if you want to check your work, this should work out, right? Like you should be able to take the B times the X bar, right? Which is 1.1 times 96.9 minus 80, you know, the A, and you should get 23.7. So if that works out, then you know you did everything right. But remember what the goal was? The goal was to actually fill in that least squares line equation. So if you look over on the right, that's what we did. So we still have our Y hat, we still have our equals. Now we have a 1.1 where the B belongs. We still have that X because those are variables that I had in the X. And then we do minus 80 because we came out with a negative one. If it had been just plain 80, we would say plus 80, okay? All right, at the beginning of this presentation, I teased you that we were going to do prediction with the least squares line equation. We weren't going to use a crystal ball, we were going to use this equation. Well, I finally get to that exciting part of this presentation. But, and there's always a big but, I first have to warm you up with some rules, right? First of all, I just want you to reflect on what we just did and realize that we can draw out a least squares line. But unlike algebra, our XY pairs probably aren't on it, right? Like in this example, none of the XY pairs are on it. So you need to be sure about at least one XY pair that's actually going to land on the least squares line. And the only one that you can be sure of is going to land on the least squares line is X bar comma Y bar. And if you reflect on it, that's why we had to calculate that, right? Because we had to use X bar and Y bar in the calculation to back calculate a the Y intercept. Now, you may be lucky and get a data set that there is an XY pair that just happens to fall on the least squares line, or maybe even a couple, or maybe more, but you can't trust that. So if you need to trust that there's a point on the least squares line, you know, it's always going to be X bar comma Y bar. All right. And now I want to focus more succinctly on to the slope or B, right? So remember, we just in our example calculated B, and we got 1.1 for me, and that's a slope. So I want to point it out that the slope B of the least squares lines tells us how many units the response variable or Y is expected to change for each one unit of change in the explanatory variable or X. So it's a little kind of a tongue twister. But if you think of our example, it's a little easier to understand. So the fact that that slope was 1.1 in our example, and that we were having XB, DBP, and YB number of appointments over the last year, what we're essentially saying by that is for each increase in one MMHG of DBP or the X for each increase in one of those, there is a 1.1 increase in the number of appointments the patient had over the past year. So as DBP goes up by one, then the appointments goes up by 1.1. Well, I don't know what one tenth of an appointment is, but you get what I'm saying, because it's just a Y hat. Okay. And so the number of units changed in the Y for each unit change in X is called the marginal change in the Y. So which if you sort of think about it, that's 1.1. So 1.1 is a slope. But 1.1 is also the marginal change in the Y for each unit change in the X. Now, I also want to just recall for you this concept of influential points, right? So like with R, if a point is an outlier, and remember, we should have done a scatter plot and everything before we got to this point, because we need our we need all those sums of X's and sums of Y's and sums of sums and whatever, right? And so like with R, if a point is an outlier, and you can see it on the scatter plot, it can really drastically influence the least squares line equation, just like it can screw up R, right? And so an extremely high X or an extremely low X can do this. And I was just, you know, pointing out a culprit we have here on the scatter plot. So always check your scattergram first for outliers, because you could end up in the situation where you're making a least squares line, and there's a bunch of outliers, you know, whacking it out. Okay, now I'm going to also bring up, you're probably like, when do we get to the prediction part? And I'm like, you just have to relax, I have to get through a few of these issues, right? So one of them is the residual. And you know, the word residual, like it kind of sounds like residue, right? Like you said, you know, somebody comes over and sets their their cup on your coffee table without using a coaster, and it leaves some residue, and you get all mad. Okay, well, that's kind of what a residual is. It's like kind of like residue, it's like something leftover, right? So once the equation is there, once you make the least squares line equation, there's something I just want you to notice. And that is you can take each x remember how we had seven patients, they each had an x, you can theoretically take each x, plug it into the equation and get the y hat out, right? So I want to just demonstrate doing that. So we have our equation up right here. So patient one, I took patient ones x, which was 70. And I plugged it in 70 times 1.1 minus 80, you know, I put in the equation, and I got negative three. Now that's why hat the real why I put it on the screen here is actually three. So as you can see, you know, it's not the same answer, right? And then patient two, I did it with patient two. Also, I did 1.1 times 115, because that's x. And then minus 80, you know, because that's the rest of the equation. And I got 46.5. Now that was a little closer, because look at patient twos wise. That was 45. It's really close to 46.5. That was a little bit better. But the reason I was doing all that is I just wanted to tell you the residual is y minus y hat. So in the first case, we have y hat was negative three and y was three. So patient one, we did three minus negative three. And we got six. So that's the residual, it's kind of like residue, right? It's like the residue leftover between why had and why, right? And then patient two, we did it again, we took y, which is 45 minus y hat, which was bigger, it was 46.5. So we got negative 1.5. So that's the residual. So so this is how you calculate the residual. And this is what it is. This is how you get it. But the bottom line is, you don't want big residuals, right? Because that would mean the line didn't fit very well. So you'll find that if you have a really good fitting line, you have very small residuals. And so you're probably like, well, what's a good fitting line? Well, we'll get to the coefficient of determination. And that'll help you see what constitutes a good fitting line. But first, I will get to the prediction part. Okay, so you're done with your least squares line equation. And you want to use it for prediction. So let's say you knew someone's dbp and you wanted to predict how many appointments she or he would have in the next year. Now, what you're not doing is you're not using you're not reusing your access from your data, we just did that to make the residuals. What you're doing is actually imagining a new thing out there. And you're going to use this equation for prediction. So you could plug in the dbp as an x and get the y hat out and say that's your prediction, right? But you gotta use some caution. If you use an x within the range of the original equation, as you can see, I put the x's up here, the range of the original equation was like 70 to 125, right? Those were, you know, the areas covered by x, right? If you do that, if you pick an x somewhere in there, this type of prediction is called interpolation, and people feel pretty good about it. But if you use an x from outside the range, like one that's really smaller, like 65, or one that's bigger, like 130, then it's called extrapolation. And then it's not such a good idea, because you don't know if it's really going to work, right? So here, I'm going to give you an example of interpolation. The patient in your study has a dbp of 80. Okay, so 80 is right in there, it's in that range. So let's use it, right? So we do it now, this looks familiar to you because we just did this when we did residuals, but we're using a new person now. So 1.1 times 80 minus 80 equals eight. So this is how we what we would do is predict that this patient would come to eight appointments next year. So there, that's how we use our least squares line equation like a crystal ball where we can predict, right? Um, so is it really this easy? Right? Is this all you have to do to predict the future? Well, it's not really that easy. You can't make a linear equation out of any old x y pairs. So remember this from our last lecture, see the scatter plot, it looks like what a cloud of gnats, right? It doesn't have a linear equation. You know, it doesn't look like you should make a line. But you know what, you feed that stuff into the software, or you feed that stuff into your B formula in your A formula, you'll get you'll get a line out of it, even if there's no linear correlation. And so if you get that line out of some scatter plot that looks like this, then it's not a very good line, right? And it wouldn't work very well for prediction, right? Because it looks pretty unpredictable. So for that reason, we can't just accept any line that is handed to us to evaluate if our least squares line equation should be used for interpretation, we need the coefficient of determination. So here we are at the coefficient of determination. And so remember how I said you have to recycle, recycle, recycle in this, well, get out your R time to recycle R. So the coefficient of determination is also called R squared. And it literally means R times R. And I just have to add this on, just like remember the coefficient of variation, remember that one, we always turn R squared into a percent, right? And so you times it by 100%. So in this example that we did remember, early on, in the last lecture, we did the R for this, that not the scatter plot I just showed you, but the for the one of dbp and the appointments, right? And we got an R that was really, really strong positive correlation, right? We got 0.95. Well, if we want to calculate our squared, which is the coefficient of determination, take 0.95 times 0.95. And we get 0.90. But we got to do that percent thing. So we end up with 90%. So this is how you say it, you say that 90% is the variation that's explained in why by the linear equation, right? So that's, you know, why varies, right? Like how many appointments they had, you know, was different for each person. Well, 90% of that variation is explained by the equation. And of course, if you take 100 minus 90%, there's 10% unexplained variation. So there's still some variation that could be explained by the variables, but not a lot. And how you actually state it is, you know, when you're done with this, if you were writing a paper, you'd say 90% of the variation in the number of appointments is explained by dbp. And I know people are like explained, like it doesn't have a mouth, like what is it talking about? You just have to say it this way, there's it's statistics ease. This is how you say it. And by contrast, or by complimentary, what you would say is 10% of the variation in the number of appointments is not explained by dbp, right? It could be explained by other things. Well, we happen to get a nice I see CD for coefficient of determination. You know, we got a nice high one, but what if it's low? Well, let's just think about it, CD should be better than at least 50% because that would be random, right? And the higher the better. So if you're on a test, nobody's going to give you a CD of like 60% and say is this any good, because I don't know, you'd be very conflicted. In real life, what I use it for is to compare models, if one is 60% and the other is 55%. Of course, I'm going to go with the 60% one, but it's still not very good, right? And if it's low, you know, the higher the better basically. And if it's low, it means that you probably need other variables to help the acts you use to explain more of the variation because that acts as not doing it. Okay, in summary, I just wanted to go over chapter four. So you realize where we've been. Okay, so we started out with a set of quantitative x y pairs. First thing we did was we made a scatter plot. We wanted to look at the linear relationship between x and y and we wanted to look at outliers. If we'd seen a lot of outliers or no linear relationship, we would have stopped there. But because this is a class we had to learn, I forced there to be a scatter plot with a linear variation and not too many outliers. So we could move forward and do are so we calculated our to see if our correlation was positive or negative and weak moderate or strong. So that's what you do if you find a linear relationship. Next, in addition, in this lecture, we calculated B and a to come up with the least squares line equation. And I just wanted to you to notice that the sign on B will always match the sign on R. So if you have a positive r, you'll have a positive slope. If you have a negative r, you'll have a negative slope, but otherwise they the numbers won't match just a sign. And then also, I wanted you to notice that strong correlations will give you high coefficients of determination, even if they're negative correlations, because remember, it's r times r. And so negative r times negative r still is positive, right? So if you have strong correlation, like negative point nine, or point nine, it really doesn't matter what direction, if it's strong, then you're going to get a high coefficient of determination. So after we did this B and a thing, we use that linear equation to calculate residuals, right, like we took the x's from the original data, put them in, got the y hat and calculated the residuals. After that, we used r to calculate the coefficient of determination or CD to decide if we wanted to use the linear equation for prediction, because if it was bad, we weren't going to do that. But we decided it was good for prediction at 90%. And we decided to use it. So that was our journey through these x y pairs all the way down to the coefficient of determination. Good job, you made it. So in conclusion, the least squares criterion and calculating the least squares line was the first thing we went over how to do that and what it all means. And then I reviewed some issues with prediction using the least squares line, because it looks kind of easy and looks kind of, you know, better than sliced bread, but there are some things you have to think about. Finally, we went over the coefficient of determination, so that you could figure out how good your least squares line equation was. And I just wanted to point out that CD kind of looks like CDs, you know, like we used to have CDs, they were so pretty and rainbowy like that. But now all CD means is coefficient of determination.