 Welcome to our lecture on simple linear regression. In regression, we're going to have something called a dependent variable. That's the Y variable. And that's expressed in terms of its relationship with the independent variable, X. Just keep remembering, it's very important to remember, X is called the independent variable. And that's the one that's supposed to affect the Y, the dependent variable. The way to remember that is Y depends on X. The Y is dependent. Now, in simple regression, you only have one X variable, one independent variable. That's what we're going to learn in this course. In more advanced courses, you're going to learn what's called multiple regression. We have several independent variables. So you'll have an X1, an X2, an X3, etc. We're still trying to do the same thing. We're using the X variables to predict the Y variable. It doesn't matter whether it's one or many. What happens though is you're not going to be able to do this by hand if you have several X variables. It gets very complicated, it's a matrix, but the computer does a few. Again, we're going to be studying simple linear regression. We're going to see whether X and Y have a linear relationship. And what is that linear relationship? I know we've talked about this in earlier lectures, but you could hardly blame me for shamelessly promoting the notion that if you study more, your grade on your quizzes might be better. Here's an example just to show you what regression is all about. A researcher wishes to determine the relationship, if there is one, between our study and the grade you get on a quiz. In regression now, there's something new we didn't have in correlation. We're going to be coming up with a mathematical equation where the variable you're studying, in this case, we want to know what goes into grades, is going to be the outcome variable, the Y, and the variable that goes into it that's independent and that affects that the Y variable is called X. And we have to know which is X and which is Y. In correlation, we were just looking at association. So over here, we clearly label our study is X, grade on quiz is Y. This is a small example. You'd never do regression with only five pairs of data, but this is just an illustration. We've got five students. The first student studied one hour, the next one, two hours, three hours, four hours, five hours, and the grades on the quiz were marched along with that at 40, 50, 60, 70, and 80. Now let's see what we do using regression to analyze this data. In this unusual example, as you can see, when you plot your points, every data point is an XY pair. Every single one of the points falls on a straight line. This is easy. It doesn't happen in real life, but it's nice to look out and understand what's happening. And if you remember from the correlation lecture, this means that especially since that's clearly a positive line, the slope is positive, the correlation coefficient R is equal to plus one, and the coefficient of determination R square is one or 100%. And what that means is that for the coefficient of determination, the proportion of the variation in Y that's explained by X is everything. There's no other variation. There's nothing else going on. Our study is the only thing you have to look at if you want to predict someone's score on an exam. And if the line looks like this, you don't even have to ask questions. Obviously, it's easy to extend the line to X equals 6, and it's easy to predict what Y will be. This doesn't happen in the real world. It's a nice example to learn from. Later on in the lecture, we'll see things that are a little bit more realistic. But let's move on and see what else regression can do for us. Everything you see on this slide, you would have learned in high school or middle school when it comes to the equation for a straight line. That's what you would have learned and when you would have learned it. If you need a refresher, go to our website. In Bootcamp, there is a section on plotting a straight line. So in this case, since this is so easy, all the points are on the line, it's easy to figure out that the line itself is 30 plus 10 times X. That's how you predict any value of Y. We'll get to what Y hat is, but basically Y hat just means this is the regression line. This is the point on the regression line. If you see 30 plus 10 X and you remember how to plot a line, how to a straight line and what the equation looks like, you know that 30 is the Y intercept, you know that when X is zero for somebody who doesn't study at all and had studied zero hours for the quiz, their grade would be 30. And then after that, every additional hour studied adds 10, the slope is 10, adds 10 to the final value of Y of the grade. In regression, we call these two coefficients B0 and B1. So B0 is what you might have before called A, B1, you might have called B. If you're not doing regression but you're specifically looking at the equation for a straight line, A plus BX, MX plus B, it's only notation. What we do in regression, we have B0 as the constant term and that's the Y intercept because if X is zero, Y hat is just equal to B0. B1 is the slope term and it tells you incrementally for every additional value of X how much it contributes to Y, to Y hat. So you can see right away, if somebody studies zero hours, they should get a grade of 30 and what about the question we had before? If someone studies six hours, what should they get? Well, 30 plus 10 times 6 is 90 and that would be the expected grade on the quiz even though it's not within our data set. Now we can see what the regression equation looks like. Y is that hat, remember don't confuse the Y hat, that's the points on the line with the input data because you have to put input X and Y. We call that YI without the hat on it. So Y hat I is B0 plus B1X and as you know B0 is the intercept and B1 is the slope term. Why do we need regression in addition to correlation? Correlation just gives you an R. Remember R goes from plus one to minus one, you get a correlation coefficient and you can test for a relationship. But if you want to actually be able to predict Y for a different value of X, you need to do regression. Or, very important, you might want to know what the slope is. You want to know the change in Y over the change in X. For example, in the real world, you might be asked a question like if I raise price by a certain amount, what effect will that have on sales? Maybe you learn something like that in elasticity. That's very important. You need a slope for that. Or if you're in marketing and you say, well, if I add shelf space, that's the X variable, will it have an effect on sales? That's a very important problem in marketing. The effect of shelf space, well, you need a slope for that. So slopes are important. You can't get a slope just by getting R. R is not a slope. It's actually related to what you'll find. If B1 is positive, R is positive. They're related to each other, R and B1. And finally, you want to see the scatter plot. You want to line through it. And you can do that very easily when you draw a scatter plot. You can get a line, and that's the regression line. In correlation, all you're going to know is if two variables are related. That's it. Here, we're going to show you what you're actually getting when you get to B0 and B1. I remember you've taken a sample. You took a large sample. You took 100 observations. If you look at Earth, there's a lot bigger than your 100 observations. If you look at people, you know, you're talking about the United States. You have 330 million people here. You're going to sample 1,000 people to get your regression equation. So your B0 and B1 terms are just sample estimators of the true population parameters, beta 0 and beta 1. So keep that in mind. They're estimates. That's why you have to test them for significance, the way to test for significance, which we're probably not going to do in this course. But you should bear in mind, you've got to test the slope of significance and there's ways to do that. You'll see it on a printout. That's important to keep that in mind. Here, we show you your regression equation, sample estimators of the B0 and B1. Then we show you what the real, which you don't see unless you were to take like a census, take every single person on planet Earth, which of course is not feasible. And then you have a beta 0 plus beta 1 x1 plus an e term. That's the random error. And again, the assumption of regression, your null hypothesis is going to be there's no relationship. That everything is 0, that x and y are not related. And basically, in simple English, no regression. Just x and y are just unrelated. And you can't use x to predict y. Here we plot the, again, this is a scatter plot. And you see the y and the x, and notice you have the observations and you got the regression line through it. You can do this with a computer. It's very simple with a scatter plot program, right? And notice that you have these kind of points, that's your data, the original data, the x and the y points. And if you look carefully, you'll see actually not one point is on the line, somewhere above, but this line is like the best fit line. We're going to learn in a minute what it means. It's actually something called the least squares line. The line that, what does it do? So mathematically, we'll learn a little bit about this soon. It's a special line that does the best job and you hear the word least squares. So to understand what it means least squares, you have to understand what those residuals are. Notice that some of the points are above. So you have a positive residual. You might think of residual as a deviation. So it's a positive deviation from the line. Some are below. It's a negative deviation. And now we're showing you here what you're mathematically trying to do. You're trying to minimize the sum of the squared errors, SSE and again the E is the residuals. So you see that sum of the EI squared and we show you what it's equal to. It's the sum of the YI that's your data that you put in. Then you have minus Y hat squared and we show you what you're doing. Formula there. Sum of the YI minus B0 plus B1 X1 squared that whole thing. That's the thing we're going to try to minimize. And using partial derivatives you can actually derive what's called normal equations and those teach us what equations have to be solved to get the B0 and B1. So if somebody asks you what do the B0 and B1 do? They minimize the SSE, the sum of the squared residuals. Again it's E sounds like an error which kind of is a random error. So SSE when you minimize those squared residuals, those deviations you actually, the B0 and B1 do that for you that's going to be your regression equation and the computer generally does it for you. We're going to do it by hand too but really in the real world you use the computer for this. Well here you see the residuals. You see? It's a vertical line. The first one is above so it's positive. The second one is below and the third one is above the fourth one is below. So you have positives and negatives. And here's the definition of a residual. It's YI minus Y hat I. See that's the EI. And basically again well first I would tell you about the sum of the EI. It turns out it'll always work out the sum of the I is zero. So that's why we're going to look at the sum of the EI squared. We're going to square those residuals, those deviations and minimize it and that we called already SSE, some of the squared residuals and that's the thing that we minimize. Okay, here's we're going to give you the steps to do correlation and regression together. So you'll get the R and you'll get the regression equation. Generally we'll give you these variables. You can get it from obviously from Excel. The sum of the X, sum of the Y the sum of the XI, YI the sum of the XI squared and the sum of the YI squared. And there's the formula for calculating R that's the formula you're going to use and R again is going to be plus a number between plus one and minus one and it's a way to test the correlation for significance. All you're doing is correlation, you must test for significance to make sure that it's it's significantly different from zero. So that's the first step. So that's how you get R. If you get R we have R, you might want to square it and get the coefficient of the termination. Remember R squared is between zero and one. And that's the proportion of the variation in Y, the dependent variable that's explained by the independent variable. So think of it as a proportion. So if you explain let's say 60%, if X explains 60% of the variation in Y that means 40% is left unexplained. Now if you want to calculate the regression coefficient B1 look at the formula. You'll notice the enumerator is exactly the same as the one you had for the correlation and in the denominator you have the same, it's kind of half of what you had for correlation. So you really have all these terms. Everything is there. So it's very easy to get the B1 term. I told you B1 and R are very much related. You can't, they're always the same sign and there's a relationship between the two. Once you've calculated B1 it's very easy to get B0. Remember your input data was you know XI, YI. Well forget the average of those two columns that you put in. So you'll have Y bar and X bar. So B0 equals Y bar minus B1 X bar and B0 is again the Y intercept. That's the predicted value of Y when X is 0. After you calculate the B0 and the B1 it's important to write out the equation. Always write it out. Y hat with a Y equals B0 plus B1 XI. Always write out the equation and it's good to know what the X and the Y represent. There are actually three ways to test the regression or correlation for significance but you'll see it in the output of Excel. So notice A you can actually test R for significance. We're not necessarily saying you're going to have to do it but this is the way you would test it and there's your HO and H1 or you can test the slope term and your HO is at B1 equals 0. And the third way is doing the F test which is part of the printout of Excel and you'll get an F value. We'll be showing you that to show you the printout. But the important thing is to remember you do have to test for significance. You want to make sure that there's a relationship in X and Y. Otherwise don't do regression and don't do correlation either. We're going to look at another simple example. Five pairs of observations where X the independent variable is how much water we use on our crop of tomatoes and Y is the yield. We're looking to see if there's a functional, a linear relationship between those two. More than just looking at correlation. And remember once again five pairs of data is nothing. It's too little really to run regression and most people will not do that in the real world. This is just so that the size of the data set is small enough so that we can write everything out and illustrate to you exactly what's going on. You see the data displayed in the table. The X, Y pairs where Y is listed first and then X that matches it that goes along with it. You know from the summations you need the sum of the Y the sum of the X. The sum of the X times Y, that's your third column the sum of the X squared and then the sum of the Y squared. So there's a neat, handy little table if we're doing this by hand it's an easy way of collecting all the sums. Of course nobody does it by hand but it's a nice way of illustrating what we're doing. Remember the steps in regression? Well here they are all laid out for this particular problem on a single slide. Step one was getting the summations you need and there we have it same as on the slide before we pulled the summations out of the table on the previous slide. Step two was to calculate the correlation coefficient using that nice big formula. It's not difficult it's just big. And we end up with a correlation coefficient R of 0.9903. So there's a positive relationship when we go to step three and we take R and square it in order to get the coefficient of determination we end up with one that's very very high 98.06% of the variation in the crop yield is explainable by the amount of water used in this particular problem. Step four is to get B1 the slope term and you see whenever the correlation coefficient is positive the slope will be positive and vice versa. So it's kind of like a check to make sure if those are not the same sign you did something wrong and we see there's a positive linear relationship between the water used and the yield of the crop. Step five is the other coefficient B0 the Y intercept and then finally you take B0 and B1 put them together into the regression equation and we have an equation to represent the line that we get in our scatterplot the regression line and it's negative 1.3 plus 3.1 times X. Now let's see what that means we're going to look at what the meaning of these things are. Alright so we have our regression equation laid out on the top line and right underneath each term is what it means. Y represents the crop yield or number of bushels of tomatoes that you get. X is number of gallons of water the amount of water used. What's the relationship between X and Y? The constant term B0 negative 1.3 is supposed to be it's the Y intercept and as you know what that's supposed to tell you is for 0X what will Y be? So if you don't water at all what will you expect from a crop? And of course a negative Y intercept doesn't do anything but help you plot the line especially in this case there's no way to explain a negative crop yield we're not going to go back to last year's crop and donate some. So all it is in this case is just a mathematical device to draw the line. It has no meaning unfortunately sometimes it does like the previous problem with our studies but in this case not. The B1 the slope term tells you for every additional gallon of water how much does that contribute to the outcome to the number of bushels of tomatoes and this is meaningful every additional gallon of water means an additional 3.1 bushels of tomatoes for your output. On the left side of the slide we show you some questions that you can answer now that you have computed the regression equation. So for example how many bushels of tomatoes can we expect if we use 3.5 gallons of water? Easy. We just substitute 3.5 in the equation where X goes right? Negative 1.3 plus 3.1 times 3.5 gives you 9.55 as your predicted regression value your predicted outcome so that's 9.55 bushels of tomatoes is the answer to the question. Of course what happens if we say well I want to I want to add 10 gallons of water that must be even better it's outside of the data that we use to construct this regression line so we don't really know and in fact one thing we do know is sometimes you can flood a crop and that's the danger of extrapolating and of course I know yes we do it all the time but we try not to and if we do better to do it close to the data you already have like maybe 6 gallons as opposed to 10 gallons but yeah there is a danger of making a prediction that's outside the range of X that you use to develop the model in the first place now on the right side you see what we were talking about computing the residuals you have the Y and the X again then now in addition we have Y hat because remember Y is the original value for an X of 2 gallons there was a yield of 2 bushels for an X of 2 gallons there was a yield of 5 bushels and so on but those points weren't on the line necessarily they may have been close so we also include the point on the line for each X value remember X is fixed X is considered to be fixed the independent variable or the input variable and Y is measured in a gallon of water we actually got 2 bushels of tomatoes the regression line predicts 1.8 bushels so that tells us where the line is placed inside our data that also gives us a way of figuring out sampling error the residuals the deviation E between Y and Y hat is listed in the next column and after that we square it very much like what happened when we were trying to come up with a standard deviation we looked at all the deviations and added them up and said oh wait this doesn't work because some are above the line some are below the line mathematically the sum has to come out to 0 that's what happens here some of the residuals some are below the line mathematically they're always going to sum to 0 so that's not helping us at all if we square it though we can end up with the sum of squared residuals or also called the SSE sum of squared error which is something that's very valuable to us because it tells us something about the inherent variation that's not due to the regression and what we got was a sum of squared error of 1.9 one thing we know about this regardless of anything else we might or might not know is that there's no other line that we could have constructed through the data that would give us a smaller value of SSE of the sum of the E squared 1.9 is a minimum because that's what regression promises us it's a least squares or a best fit line through our data here you see the Microsoft Excel output for the same problem that we just did painfully by hand you can find anything on this output that we already found and you can see a lot of it is labeled R the correlation coefficient R squared the coefficient of determination what about the regression equation Y hat sure all I need is to find the coefficients and then I can put them into the equation which you also see written out on the output and it's exactly the same equation we got before in addition I can find information to help me decide if the regression is significant and for example the F test the F statistic how do I know if it's significant or not right next to it is a column called significance F and you can see that over here it's 0.00115 and what is it testing it's testing the null hypothesis that there's no regression that X does not affect Y there's no linear relationship and H 1 the alternative hypothesis is that yes the regression is significant with a large F statistic like that we can be sure that it will be significant we'll talk more about that later but one thing you notice in this case that significance F is P that's basically the P value you used before when we were doing inference and if you typically work let's say at an alpha level of 0.05 this is much much smaller than 0.05 and so indeed there is a significant relationship between X and Y in this example if you look at the output you get from Excel you'll see something called an ANOVA table that's where you get that F value we'll see it clear in the future slides if you go back and you'll see it's smaller you'll see better slides later but basically what you have to realize is what the way we get that F value is we look at the sources of variation in regression we look at something called the total variation in Y and there's the formula some of the Y minus Y bar square that's the total variation in the Y the thing you're trying to predict in the previous problem we were trying to predict the yield, the tomato yield we had a problem we were trying to predict grades a lot of researchers are trying to predict longevity how long people live in any case it's a total variation in Y and we can break it up into two components something we call the explained variation which Excel calls regression I like to give it as explained explained by X it's the part of the variation in Y that's explained by the X X is trying to explain it like our study was trying to explain the grades we're trying to explain crop yield tomato yield whatever it was using about the water we're using so there's an explained component that's called regression in Excel but really explained another term that's often used same thing and you'll see it's an SS that's the sum of squares the sum of the Y hat minus Y bar squared you're not going to do this by hand computer does it for you and what's left is called the unexplained variation that's SSE and we know what that is in Excel it's called the residual that's that SSE so what is the residual that's the unexplained variation that's what X did not explain when it comes to explaining Y it's a part that's unexplained and this is what we do in the ANOVA table look at the sources of variation and regression eventually that will lead to that F value what is F? It's a distribution like Z, T now you have a new distribution called F and here we see the same thing again the total variation in Y look at that sum of the Y minus Y bar squared you divide by its degrees of freedom of N minus 1 you've actually got the variance of Y put a square root around it and you've got the standard deviation of Y so we're familiar with the idea of variation okay so we look at the total variation which it's called total in the Excel print out the explained is called actually it's called regression that is the formula for we're not going to do this the computer does it for you and then what's left is called the residual in Excel or it's really just the unexplained variation looking at the print out we had the total variation in Y was 98 the explained that's explained by X which is called regression in Excel that turns out to be 96.10 again you don't have to do these calculations the computer does it for you so you'll have the explained was 96.10 and we saw that number before the 1.90 is what's left these are called sums of squares and the sum of squares total is equal to the sum of squares the explained plus the sum of squares SSE the unexplained so now that's how we get that that's how the computer gives you those sum of squares 98, 96.10 and 1.90 and R squared the portion of Y explained by X now one way of getting it is take the R which we got before 0.9903 and square it and you get 0.9806 well the computer does it through the means of the variation the sources of variation if you take the explained variation that's how much X explains of Y that was 96.10 and the total variation of Y of 98 divide 96.10 over 98 and you get also that 0.98 rounded in other words 98% of the variation and crop yield is explained by the amount of water used so it's very important to understand what R squared is doing you can get it from two different ways looking at the R and squaring it by just getting it from the print out and looking at the sum of squares regression divided by sum of squares total in this case 96.10 over 98 and you get the R squared that way and you get a coefficient of determination a second way more important is to understand what is that coefficient determination in example 2 we're once again trying our studied in grades it's a different set of data n is 7, 7 pairs of data 7 is better than 5 but it's still not really enough we would really like more but we're doing this one by hand and so I'm being easy on myself this is a quiz where the highest possible grade on the quiz is 15 unlike before X is our study study the grade on the quiz is Y we're studying the grades and what goes into grades and we're looking to see if the number of hours studied determines in some way the value that you get on the quiz it's very important to know what variable you're calling X and what variable you're calling Y and again you could see we could do this by hand if necessary or for whatever reason if our computer and calculator all broke and we can get the 5 necessary summations which you see laid out over there under each column and let's move on and do the calculations to get the regression equation once again we laid out all the calculations that you need if you're doing this by hand or with your calculator on one slide steps one through six step one we want to make sure we have all the sums that we're going to need all the summations we just copied that from the previous slide and then now we have the formulas that we're going to use these summations in for step two we compute the correlation coefficient R just plugging the numbers into the formula R is 0.98 so it's pretty high and it's positive it's fairly close to one which means by the way when we get to it that the slope will also be positive the sign on R and the sign on the slope B1 are the same if we take R and square it we end up with R squared the coefficient of determination and that's 0.9604 which means to us that more than 96% of the variation in grades can be explained by our study that seems pretty high when we get to the regression we're going to look to see if it's significant but it seems like it's going to be because that's a pretty high number a pretty high proportion step four and step five are basically to get the coefficient so that you can write down the regression equation in step six step four gives you B1 the slope term again plugging numbers into the formula you get 1.3214 there's a positive linear relationship between our study and the grade on the quiz step five B0 the y-intercept in other words what will your grade be if you don't study at all and that's 4.5715 do a tiny bit of rounding and write down the regression equation 4.57 plus 1.32 times x so for any value of x we use this equation to predict what y will be what the grade will be repetition here the regression equation is again written at the top of the slide 4.57 plus 1.32 times x will give you a prediction at any value of x what will the grade on the quiz be as a prediction now explain the meaning of the regression coefficients again B0 the y-intercept is 4.57 which means if you study nothing the grade on the quiz will still be not bad if you're happy with that 4.57 or thereabouts and for the slope it tells you something about change in y over change in x or in other words for every additional hour studied it contributes another 1.32 to your final grade on the quiz next question what happens if someone studies three and a half hours what would we predict for the quiz score that's a perfect use of regression and we're saying that three and a half is the value of x that we want to look at if you plug that into the regression equation 4.57 plus 1.32 times three and a half you end up with 9.2 that's a quiz grade of 9.2 or thereabouts let's look at the excel for this problem you can see indicated where to pick up all the things that we worked so hard to compute on the previous slide r 0.98 r squared the coefficient of determination 0.96 the intercept that's b0 is 4.57 the slope which excel somehow labels x variable one but we know it's the slope is 1.32 so if we want to write out the regression equation we could do it from those but before we do that let's first look to see if this regression is significant how do we know if it's significant or not we look at that circled value significance f it's a tiny, tiny, tiny number that's the probability of getting the sample evidence or more extreme if the null hypothesis is true it's such a tiny probability so clearly the null hypothesis that there's no relationship between x and y is not true there is a linear regression and it's significant so the answer is yes now we want to write out the regression equation we pull that by looking at the coefficients and it's the same as before y hat is equal to 4.57 plus 1.32 times x what's the proportion of variation in grades explained by our study okay that's a lot of words and you have to figure out if you can understand it there's a lot of there's some language in statistics it's not just math but basically what you're looking at is the very definition of r squared r squared the coefficient of determination is the proportion of the variation in grades the y that's explained by x our study so pulling it off the printout we have 96.14 and then finally I can ask any question I want in terms of what can I predict for a grade for a person who studies a certain number of hours we're going to just try 3.5 once again and we pull the values for r squared for y hat off of the printout we've got 4.57 plus 1.32 times 3.5 and once again we get 9.2 the grade on the quiz for someone who studies 3.5 hours should be about 9.2 this is example 3 job performance this is not unusual by the way industrial psychology they try to relate performance to some kind of other measure like a test in particular so here a company wants to see if there's a relationship between some kind of major field test score they look at people who are major in business and they have some kind of score for that actually is a real one which we don't mention here given by the ETS company but we're making it up by now XYZ major field score and they're relating to job performance which is measured by a team of supervisors and they rate each of these workers on a scale from 0 which means they're a horrible worker all the way up to 20 that they're outstanding so remember each one of these so the first person his or her score was a 70 on this major field test the XYZ test and their performance was a horrible 6 the last person they looked at they got a 46 on this XYZ major field test score but their performance was a mediocre 10 see one person actually got a 20 and their score was a 90 on this field test score anyway this is the input data so you have the X and the Y we have the sum of the X on the bottom and the sum of the Y well if you decide to do this by hand there are all the sums you need sum of the X, sum of the Y, sum of the XY sum of the X squared, sum of the Y squared and so the next step is we get the R and by now you're familiar with the formula and notice N is 16 right that's why you have 16 you just follow these steps the formula that we're not asking you to memorize you end up with a positive slope of plus 0.80 again you have to test with significance we'll leave that with the F test okay R squared coefficient of determination is 64% 0.8 squared we explained 64% of the variation in job performance is explained by that using that XYZ test we just left unexplained an industrial psychologist may say let's add another variable, let's add X2 maybe even X3 we're just learning simple regression we just need one variable X1, that's all we're using the slope term again you get a lot of these numbers are there we did it from the correlation so we see the B1 is 11, 3, 2, 8 over 47, 5, 5, 2, remember these numbers we calculated before from the sums and it turns out the slope is 0.238 it's a positive again B1 is positive, R is positive they have to have the same sign so B1 is a positive 0.238 B0 using the formula you've been taught Y bar minus B1 X bar and you end up with a value of minus 2.85 and now we write out the regression equation Y hat 2.85 plus 0.238 X remember you have to remember the X's and the Y's we're trying to predict job performance the Y's job performance now we can see look at the Excel printer keep this in mind that when you do the Excel and you do it by hand you do both the numbers may be slightly different because Excel uses a lot of significant digits a lot of decimal places maybe even 10 and you're only doing this your rounding to 2 or 3 so the answers will be slightly different of course Excel is much more accurate than what you did by hand but it's close enough first thing to look at is the regression significant we're going to do this on the next slide too look at that value 0.00018 that's a lot less than 0.05 if it's less than 0.05 and you're testing it 0.5 if you're testing it 0.1 it's less than 0.01 it's less than 0.001 it's a very low number telling you the regression is significant the R is called multiple R and it's close to what you got 0.80 by hand a little bit off but close enough it's not a multiple this program is used for multiple regression so they call it multiple R this is just our lowercase R that's the correlation coefficient 0.802 R squared is 0.643 close enough to what we got a little more than 64% and of course the regression equation notice B0 is minus 2.86675 etc and B1 is 0.238 now keep in mind if B1 is positive that tells you R is positive if this were a negative now it's not the slope that's negative we're going to explain all this on the next slide but keep in mind that you always check before you answer the question as to what R is you got the R but it could be a negative number how do you know if the slope is negative we're going to do this problem now with MSXL and keep in mind that when you do it by hand you're rounding the computer does a much better job it's using at least nine significant digits so the numbers may be slightly slightly off of course the computer got it right you've been rounding what you round more than the computer does so don't be shocked if the numbers are slightly off in the real world you're not going to do this by hand but this is for teaching purposes there's the printout you see the R actually we're going to go far off it's 0.802 you might even round it to 0.80 but before you decide it's a plus or a minus you look at the slope term and the slope is positive R is positive and also bear in mind that since this program is used for multiple regression as well it's called the multiple R this is really just R you only had 1X so this is just the same as R 0.80233 same as the 0.80 we got R squared is close 64% the computer gave us 0.6437 the adjusted R squared don't worry about that's a mathematical adjustment if you want to be a little more precise we're going to ignore that that's actually the square root of the mean square error it's used a lot in predictions and to create if you want to learn the margin of error and doing various tests we do that in the next course not in this course the most important least is the regression significance and I look at the significance of F now the F value the F is a distribution like Z or T it's actually related to T and notice the F value is 25.298 degrees of freedom of 1 in 14 1 in the numerator 14 in the denominator and from that the computer can determine the significance and it tells you it's 0.000 1 definitely a lot less than 0.5 so the regression is significant again if that value the significance is less than 0.5 it's significant now as far as the regression equation goes the truth is I should have told you this you always look at the significance first because if it's not significant then you've got garbage you don't talk about anything you don't talk about R because R is not different than 0 so really you should look at the regression first this is these are the answers to the questions that could I usually ask when you look at the excel printout again first thing you got to keep in mind the values you get from the calculation from excel and what you did by hand usually not exactly the same it'll be close you'll see that we got slightly we're slightly off on the value of B0 because when you do it by hand you're doing 2, 3 decimal places the computer is working with maybe 10 decimal places and if you have a really super computer you might be using 20 decimal places so excel is more accurate than what you did by hand so don't worry about these slight deviations from what you did and what the computer is showing you in the real world you're not going to be doing it by hand the first question yes you saw that value there it was a lot of zeros definitely less than 0.05 and even if you're testing the alpha of 0.1 less than that too basically showing you this is not the sample evidence the sample evidence could be represented by the r value the r squared that's the sample evidence is that what you expect to see when x and y are unrelated well you don't expect to see such a high r square or the kind of scatter plot if you did it you know it doesn't look like a random pattern to me if you plotted this you'd see that any case just by looking at the probability you can see if the regression is significant the answer is yes what was the value of b0 well the printout showed minus 2.87 you did it by hand we were slightly off don't worry about that the b1 is 0.238 we have the regression equation y hat which again is job performance that's the thing you're trying to predict it's job performance y hat represents job performance is minus 2.87 plus 0.238x1 the value of minus 2.87 makes little sense in the real world because that means if you didn't if your test score was a 0 we're predicting you'd have a job performance of negative 2.87 the scale that was used by the panel was 0 to 20 so there's no way of getting a negative number so this one doesn't make too much sense but again if you had a bigger sample perhaps remember these are just estimates b0 and b1 are estimates of the true beta 0 beta 1 the parameters but this happens all the time you might get a number that doesn't make too much sense for the intercept term so we'll use it just for the equation minus 2.87 the slope term is important 0.238 every point on that test score the XYZ test score every point on that one you go up 1 point you go up by 0.238 in terms of your performance score we might do better with 10 points every 10 points on your XYZ test you go up by 2.38 in terms of the panel of judges how they evaluate you your performance basically the main thing is there's a relationship so we can use this for predicting we'll do that in a moment the R as we said was 0.8023 before you decide positive or negative look at the slope, the slope was positive it's plus 0.238 so R is positive and the proportion of the variation of what I explained by X that's called the coefficient of the termination 12.6437 about 64% a little more than 64% which means that approximately 36% is left unexplained and finally which our performance would you expect to somebody with a test score of 65 plug 65 into the equation and do the arithmetic and there you get your predictions around the 12.6 give or take a little so remember 12.6 is the scale you're the most awful performance in the world all the way up to 20 it's around the 12.6 here are some of the terms that you might see or that you do see when you use Microsoft Excel for regression you will see it also on any statistical package that you use for regression in regression you can divide the variance as you know separate the variance we looked at this into the total the total deviation can be split into the deviation or the variation due to regression and then the variation due to error which is also called the residuals the the term SS means sum of squares because same reason as before we square all of these deviations because otherwise it will add up to zero and we use the sum of squares sometimes we take the sum of squares and divide by the degrees of freedom to get a mean square all of this is very very very much like what we do in order to get a standard deviation the sum of squares is like the variance we divide by the degrees of freedom and we get the standard deviation in this case what we have is the degrees of freedom for the regression is one the total degrees of freedom is always n minus one and so what's left for the residual the error is n minus two SSR is the sum of squares due to regression SSE is sum of squares due to random variation also sometimes called sum of squares residual which might be a little confusing since regression has an R and residual has an R SST the sum of squares total if you take any of those sum of squares and divide by the degrees of freedom you get the mean square mean square regression mean square error and if you take the mean square regression divided by the mean square error that's the F ratio that's the statistic that's used for testing the significance of the regression if it's really shows a linear relationship between X and Y or if it's just not strong enough to be meaningful at all in addition if you take the sum of squares regression and divide by the sum of squares total you're getting the proportion of the variation of the total variation in Y that's explainable by X in other words by the regression since we're only doing simple regression with one X and this is R square which we also got by taking the correlation coefficient R and squaring it so this is just to help you navigate around the Excel output in this slide we'll explain a little bit more about the Excel output what is the F ratio we look at the sum of squares regression divided by its degrees of freedom and that's called the mean square regression a mean square is just sum of square divided by its degrees of freedom and if you take the sum of square error the SSE the thing that we minimize mathematically divided by its degrees of freedom you have the mean square residual and it can be shown mathematically that ratio of mean square regression of a mean square residual results in an F ratio again F is just another distribution like Z or T all right now if X and Y are not related you'll get an F ratio sum of things 0 and 1 maybe close to 1 if you just take a bunch of random numbers and pretend they're related you know we've done this a few times I've played with that and usually I get something things 0 and 1 it might be a little bit more than 1 but generally it's nothing's going on it's just random numbers it's you're going to get something close to close to 1 maybe a little bit lower okay so generally F ratios being 0, 1 not generally always they're not going to be statistically significant now for all the points on a line then you've explained everything there's no unexplained then the mean square regression over the mean square error you're dividing by 0 remember for all the points on the line there's no deviations no residuals so SSE is 0 divide by degrees of freedom you still get 0 so the explained over the unexplained is explained over a 0 so you actually get infinity now the computer will not give you infinity it means the computer will be working forever I've tried a couple problems like this in classes and guess what happens the computer stops at some point but a huge stupendous number it switches to mathematical notation and stops always the computer it will burn out the computer alright so if all the points are on the line the F ratio is going to be a stupendous number close to infinity your F value is 30 just keep in mind that if nothing is going on your F value should be something near 1 maybe even 0 won't happen but something between 0 and 1 if you get an F value of 30 that means you're explained this 30 times greater than the unexplained and that's generally not going to be a chance so that's why your F value is significant but all you've got to do is look at the significance so don't worry too much about what the F is but keep in mind that it's a ratio to the unexplained so the higher it is you're explaining more than you're unexplaining so let's say the F ratio is 100 that means you explained 100 times more than you left unexplained remember this is always the basis of the ANOVA table is explained explained by X plus the unexplained is a total variation in Y now I'm going to start doing problems only using Excel this is the real world how many years of education we start with somebody who took a sample of people and there were 12, see observations of 12, 12 people and one of them the first one had an education of 9 years and now we have his income in thousands of dollars which is 20 which is 20,000 dollars a year, not too much and we had two people who had 20 years of education and one of them had an income of 43,000 the other one had 70,000 we're representing that by 43 just keep in mind that income is in thousands and there's your Excel print out and the first thing I notice it's significant remember that's the first thing you look at how do I know it's significant? first of all the F ratio is 28.6 we did a lot of explaining relative to unexplaining 28.6 times more explaining than unexplaining and more importantly the significance of F is 0.0003 a lot less than 05 or 01 that's generally what we test that by the way I know that we have a significant regression and here's the regression equation we're going to have it on the next slide that Y hat which is income that's a dependent variable is 11.02 and I see 3.197 is a slope term the intercept again was 11.02 the slope term is let's call it 3.20 and the R is 0.86 and it's plus because the slope is plus R squared is 0.74 and I think that's enough and we'll do more on the next slide the regression equation is Y hat equals minus 11.02 plus 3.20x we're rounding or if you want to write it out this is better income equals minus 11.02 plus 3.20 times years of education so now we don't confuse the dependent or the independent variable since B0 is minus 11.02 in theory if you had zero years of education you'd make 11.02 times a thousand negative 11,020 a negative income which probably indicates that your family is helping you or you're on welfare but somebody has no education would probably be making negative income and the slope B1 of 3.20 what does that teach us? every year of education each year increases your income by 3.20 times a thousand or 3,200 every year of education is worth 3,200 the correlation of R it's positive it's 0.86 it better be positive more education you make more income as a teacher we hope it's positive so R is 0.86 which is strong it goes from plus 1 to minus 1 and it's plus the R squared which is easier to explain is 74% approximately so that tells us that education explains 74% of the variation in income 26% is due to other factors that means there will be exceptions and that's why I don't believe somebody will say well I know somebody who never even went to elementary school and is making $10 billion and he was President of the United States he doesn't prove anything because we're not saying it's a perfect relationship there will be exceptions and that's the unexplained those are those exceptions that everyone tells you about I know the person who smoked 5 packs of cigarettes a day and lived to the age of 100 sure there are going to be exceptions unless there won't be exceptions if R squared is 1 which means R is 1 or negative 1 if R squared is 100% no exceptions if you look at the R squared you take the sum of squares regression of sum of squares total and if you do that ratio it will be exactly the same 0.741 we rounded it to 74% these are the minor points using MS Excel the mean square error which Excel calls mean square residual is 61.0969 the square root of that is 7.81645 that's called the standard error of estimate and it's used for confidence intervals in any case you'll need this for future courses the F ratio is used to test the hypothesis that there's no regression X doesn't explain Y that's why we need that F ratio regression is very significant with an F value of 28.61 as I explained you explain 28 times more than you unexplained and if nothing is going on you should get an F value of roughly 0 to 1 certainly once you get an F of 28 it's going to be significant you explained a lot and the probability of getting the sample evidence or even a stronger relationship if the X and Y are related if HO is true that's that probability 0.0003 in other words almost impossible to get this kind of data remember the data is represented by the if you want let's say the R squared or the scatter plot of the R if nothing is going on you shouldn't be seeing an R squared this high the R squared is too high to occur if nothing is going on it's too strong a relationship to occur by chance and that's what that probability is showing you because the HO is that nothing is going on it's just the sample evidence that supports this and the answer is the sample evidence is not supporting if nothing is going on the sample evidence measured by the R the R squared or the scatter plot or whatever is telling you something is going on that's what we reject HO when it comes to simple regression the three different tests give you essentially the same results you can look at the F test to look at the regression that's the method that we've selected and we're using but you can also test the R the correlation coefficient for significance you can also test with n-2 degrees of freedom or you can test the slope that you'll see on the Excel printout by the way that's the same as testing R but again this is all in simple regression so you have three ways to test the significance you're basically trying to see if X has an effect on Y the related or some kind of effect there's any kind of relationship but look at the T test I'm going to just talk about some of the printout when I ask you to do the calculations beta 1 equals 0 and there's no slope ignore the slope which means that's how X affects Y through the slope if the slope is 0 it means X has no effect on Y so that's the same as testing the regression the simple regression for significance well the same as testing R as you know R and the slope term are related they always have the same sign the relationship in the previous problem the T value this is for education and income and this is where you're doing the test on the slope where HO is at the slope is 0 you ended up with a T value of 5.34 8776972 and that's extremely significant like the probability attached to it is 0.00324168 although you get the same T value if you did the T test on the correlation where HO is at row the population correlation you get exactly the same T value and again it's also the same as the P value for the F test mathematically they get two tests give the same result in fact F equals T squared if you want to know but in any case it's three ways to test we've shown you how to do it with the F value because when you do multiple regression you're going to be using the F test but it's good to know there's a T test for the slope and that T test in the simple case the simple case is the same as looking at the F test that's why you get the same probability so testing the B1 term in simple regression is equivalent to testing the entire regression when you only have one X variable as it says after all there's only one X variable in simple regression in multiple regression we have a lot of X's you can do individual tests with each of the slopes X ones you can test that one then there'll be a B2 X2 you can test that you have a B3 X3 so you can test them individually the individual variables the F test then will then be for the entire regression so for example you have five X variables five independent variables you can do one test on the whole thing with all five or you can look at the individual tests on the slopes on the B1 slope the B3, the B4 and the B5 here we're going to use the equation to predict how much income would you predict for an individual of 18 years of education plug the 18 into the equation income equals minus 11.02 plus 3.20 times 18 that it works out to 46.58 in thousands so your predicted income if you have 18 years of education is 46,580 of course there's a margin of error that's how you want that standard error of estimate because you take the more advanced courses in regression they teach you about this obviously a margin of error and you'd be using that standard error of estimate and since it's not done in this course we're not going to learn it but just keep in mind your answer isn't really 46,580 it's going to be a plus and a minus that margin of error in this example looking at our spent gaming and if it's related to high school average and the important thing here is high school average right because we want to know we want to study that variable the variable we want to study we always call y and we want to look at what might have an effect on the y why aren't all the high school averages exactly the same in this example the researcher is studying number of hours spent gaming 22 students were randomly selected you see the data on the right side of the screen x is in hours number of hours spent gaming and y is high school average and I'm going to leave it to you to answer the question why did we call gaming x and why did we call average y and if you can't figure that out you might want to go back to one of the earlier problems in this lecture first thing the researcher looks at is the regression significant or not see where it says the word significance of f and note that the value the significance of f is 2.55559 e to the minus that's called scientific notation and it means move the decimal 7 places to the left so our significance level is 0.000 anyway you see it up there highly significant basically this is not chance what we're looking at cannot be explained by sampling error which is another word for chance and again if you look at the f value f is like like the t value that's related to t you're explaining 57.7 times more than you leave unexplained and this is not what happens through randomness something is going on so now we know we have a significant regression now we have a right to look at the coefficients and notice the intercept term is 84.045 we're going to round that soon and the slope this is important we have a negative slope minus 5.26 anyway a negative slope tells us immediately there's an inverse relationship when x goes up y goes down and notice the r we're going to talk about this in the next slide even though excel says the r is 0.86 you have to know that it's a negative 0.86 again if the slope is negative r must be negative do not forget that fact okay some programs of course will show a negative r because r goes from minus 1 to plus 1 we have a negative 0.86 correlation coefficient the r squared is about 74.27 5% or round it you know the x variable explains 74.3% of the y variable and the rest will explain the other next slide okay here are some questions that can be asked based on the excel output you just saw right is the regression significant well you already know that the other professor Friedman made it very clear on the previous slide what's the value of b0 the y-intercept 84.05 what's b1 the slope and here you know it's a negative slope negative 5.26 once you put those in what's the regression equation well you can write it out y hat is equal to 84.05 minus 5.26 times x that's the regression equation from the print out we also saw that the correlation coefficient is negative 0.86 it's an inverse relationship you know that because the slope is negative the x and the y work in opposite directions excel has a little problem with giving you a negative correlation coefficient it's always going to look positive so you always have to make sure you take a look at the slope b1 before you decide to answer a question about r the correlation coefficient excel thinks that you know this and we're all geniuses what's the percentage of variation in y that's explained by x remember we're studying y we're using x to explain variation in y well that's just the definition of r square the coefficient of determination and that's 74.28% again just reading it over the output what high school average would you expect for someone who spent 2.5 hours per day gaming is plug that value of x into the regression formula and when you do that you end up with a high school average a predicted high school average of 70.9 the researcher is interested in determining whether or not there's a relationship between ounces of alcohol consumed daily and workplace performance and this kind of this a lot of companies have some kind of performance measure this company measuring it on a 10 point scale with 10 means you're fabulous and 0 means you're worthless as an employee probably ready they're gonna fire you so now we want to know you decide which is the x which is the y obviously the y variable known as the dependent variable is performance we're trying to predict performance that's the thing you want to predict okay and we're using alcohol as the independent variable and we took 14 employees randomly we selected them randomly and somehow we got an honest answer to alcohol consumption not so we to get honest answers to that question and we also have the performance ratings and notice that the first two employees had ratings very good ratings of 9 and 10 the last two their ratings were 3 and 2 which are pretty low on a scale that goes from 0 to 10 okay and the next slide will show you the Excel printout here you go the first thing we look at always always is to see whether this regression was significant because if it's not significant we might as well just throw up our hands and move on but this is significant there's definitely a significant relationship between x and y alcohol has some effect on the performance scores and does it have a positive effect or a negative effect well you wouldn't know that just by looking at the R on the output from Excel so first let's look at the coefficients B0 7.48 B1 is negative 0.1 so yes there's a negative slope there's a negative relationship the more alcohol the lower the performance and R is negative 0.81 which is fairly high and R square is 0.6609 which means that a little bit more than 66% of the variation in performance can be explained by alcohol consumption first question is the regression significant well we saw the significance of F and was pointing a couple zeros there yes it's significant what is the value of B0 that's called the y-intercept 7.48 it's on the printout the B1 slope term that was a negative 0.10 rounded we write out the regression equation y hat is 7.48 minus 0.10 x i and remember x i is the alcohol consumption ok what is the correlation coefficient this is important and we emphasized it in the previous slide that when the slope is negative R is negative so R is minus 0.81 what is the percentage of the variation of y explained by x that's known as the R square the coefficient of determination and that was 66.09% so we did a good job of explaining the variation in y and finally what performance would you expect with somebody who consumes 10 ounces of alcohol daily you plug in the 10 into the equation right so you have y hat which is the performance measure you're predicting it now with x is 10 7.48 minus 0.10 times 10 and that works out to 6.48 and that's the performance score that you predict for somebody who consumes 10 ounces of alcohol the secret to learning this material is not a very big secret I've been pushing this practice after every lecture we do problems problems problems practice practice practice do your homework if you can't don't have an instructor for this course but you're learning it on your own go to our homeworks page go to the handouts page there are lots of review problems do the problems in this very lecture the next lecture up is a review session actually on regression now specifically it looks at using Microsoft Excel for regression but you can the data is there you can certainly take the problems and use your calculators and also do them with Excel and you can compare the results thank you very much for joining us in this lecture