 Welcome to this review session. In this session, we're going to be doing some regression problems, solving them using Microsoft Excel, and explaining how to interpret the printout. We'll be looking at several problems over the next few slides. Pay attention. Do listen to the audio. I hope you enjoy this lecture. Look at this problem. Again, in regression, you have to know which is the X variable, the independent variable, and which is the Y variable, the dependent variable. I think you're trying to predict. Well, here X variable is a math score, giving the possible employees a math test, and then we have their scores. And then we also have them rated on job performance. Some 0 to 10 scale with 10 means you're an incredibly good employee and 0 means you're awful. Now, the first thing you do when you look at a printout, this printout comes from Excel. Again, we have a handout on how to do this. How to use Excel to get a scatter plot, which you have on the side, and how to get this regression printout. You're going to have to understand how to read a printout. So the first thing you look at is the significance value. Now, in this case, I look at the significance value. There's an arrow there. And it shows that the probability of getting the sample evidence, given that X and Y are not related, that you shouldn't be using one to predict the other, is 0.765. Now, the rule is, if you're testing at the 0.5 level, this is not significant. Basically, what the computer, what you're learning from this printout is that this is not unexpected. You know, if X and Y have nothing to do with each other, the sample evidence supports that. Indeed, look at the scatter plot on the side. There's this chart title. Again, I kind of had the computer draw on the regression line, but it's meaningless because clearly this is just a random pattern. See, your sample evidence is showing a random pattern. X and Y are totally unrelated. It's almost like a circle. You can draw that line anywhere. Computer thought this was the best job for the least square line, but really there's no good line. First thing you always look at is the significance value. Full of thumb. If the F value is between 0 and 1, it's never going to be significant. F of 0 is totally not significant. Would you explain nothing? The X does not explain any of the variation on Y. Right away, when you see this, that's not significant. You don't write down the regression equation. You do nothing. You've got garbage. X and Y are unrelated. So in this case, would you tell your boss, you know, you did a study and noticed that you took 13 subjects and you found absolutely no relationship between math scores and job performance. And again, that's confirmed with the R. Look at the correlation coefficient. It's called multiple because this is also used when you have several independent variables. It's really just R. Again, the printout called multiple R, which is really the correlation coefficient. Notice it's 0.09. Quite close to 0. In any case, it's not different from 0. That's what not significant means. As far as you're concerned, you cannot reject HL, which is saying that the row, the population correlation is 0. This is basically no different from 0. There is no relation between X and Y. You should not be using X to predict Y. You have no connection between these two variables. Okay, now we're looking at another problem. X, okay, X are independent variable is years of education. How many years of education somebody has? Okay, to be a high school graduate, 12 years of education. And we'll look at their hourly wage at this company. Want to see there's some kind of connection between years of education and hourly wage. The first thing you look at is the significance of F. And as you can see, it's way below 0.05, right? It's 0.0001, 1 in a million, I think. Okay, one chance in a million of getting this kind of sample evidence. Now, the sample evidence is either that scatter plot or the R value. But basically, the minute you see this significance value, you know that X can be used to predict Y. Now, first of all, let's write out the regression equation. Now, you see where it says coefficient? And you see the intercept. The intercept is the constant. Okay, that's the value of Y when X is zero. So your intercept is negative 17.937, and your slope term is positive. You want to note that it's positive, 3.349. So the way you write out the regression equation, Y hat equals minus 17.937 plus 3.349X. X is years of education. So suppose I ask you to predict, and I say let's say X is 16. So plug 16 into the equation, 3.349 times 16 minus 17.937. And now you've predicted the Y hat, which is hourly wage, and that's your predicted hourly wage. Okay, that'll be the regression line. Now, other things of interest. You can tell your boss, well, the correlation coefficient, R, is a positive 0.868. Very strong correlation. We use the word strong on a 0.86. Sides being significant, which means it's not zero, you can say it's strong, 0.868. The R squared, now that's the proportion of the variation in Y, which is hourly wage, explained by years of education. Well, it's 75.4%. There's a lot of explaining. Let's just call it 75%. That means only 25% is left unexplained. We explained approximately 75%. Unexplained is 25%. Now, that could be random factors. You may have other factors. Education may not be the only factor, but education did a very good job in explaining the variation in the Y, the hourly wage. The adjusted has a mathematical thing about degrees that ignore it, as do degrees of freedom. We'll work with the regular R squared. R squared is just R, the correlation coefficient squared, which has a lot of meaning in regression. The standard error, we're not going to use too much. It's used for all kinds of inferences that we're not going to do in this course. Observations, you know, there are 20 observations. You can count it. There are 20 points. If you're looking at essentially 20 people, you're looking at the years of education and hourly wage. The only thing you have to realize is step one, look at the significance value. If it's significant, I can write out the regression equation. You write it out and I told you what it was. Y hat equals minus 17.937 plus 3.349x. You can use that for predictions. I showed you how to do that. You explain what R is to your boss. It's a measure of association. It's a plus 0.868. Your R squared, another important measure, you tell your boss, boy, education explained approximately 75% of the variation in wage, only 25% left unexplained. And now you know how to read a printout. Let's look at this problem. X, the independent variable's age, the dependent variable Y, is task completion time in minutes. Okay. You have the X and the Y. First thing, you decide. Is the regression significant? It's an arrow pointing at the significance of F and it's 0.0001. Definitely, definitely significant. Definitely less than 0.5. That's one in 100,000, I think. One chance at 100,000 of getting this kind of sample evidence of X and Y are unrelated. This is not what you expect to see. And again, the scatterplot would confirm that. What is the value of the intercept term, B0? Well, you can see it on the bottom, 11.525. That's the intercept coefficient. So B0 is 11.525. What is the slope term, B1? 1.032, right? You want to write out the regression equation now based on that? Y hat equals 11.525 plus 1.032X. Notice the positive slope, which means the correlation is also going to be positive. What is the correlation coefficient? Plus 0.868. The correlation coefficient always has the same sign as the slope. There's a positive relationship in age and how long it takes to do the task. The older you get, the longer it takes to do the task. What is the proportion of the variation in Y? Y is task completion time. It's explained by X. How much does it explain? How much does age explain? Well, you can see from the R squared value, 75.3%. Okay, 75.3. Let's round it to 75% approximately. Again, 25% is left unexplained. Other variables perhaps. But 25% is unexplained. 75% is explained. So again, you tell your boss, I found a significant relationship between age and task completion time. And notice what happens. Every year he gets older. It takes another minute to complete the task. 1.032 minutes to be exact. That's the slope. The change in Y, the change in task completion time over age. Every year means another 1.03 minutes to complete this task. If you ask to predict, let's say somebody is 50, how long would it take based on your regression model? Plug 50 into the equation. 50 times 1.032 plus the intercept plus 11.525 and you'll get your answer. Okay, let's look at this regression now. Age is X, the independent variable. The dependent variable's job satisfaction score. Are they related? Question one. Is their relationship here significant? Answer is look at the significance value of F. 0.0004. That's a lot less than 0.05. So we're testing at the 0.05 level. Yes, it's significant, highly significant. In fact, there's 10, 100, four chances in 10,000 getting this kind of sample evidence. That's the scatter plot of the R. Of getting this if nothing is going on. Well, this is not what you expect to see. Okay, so we have a significant progression. There's a relationship between age and job satisfaction. What's the value of the constant B0? 1 or 9.27. Okay, we've rounded a little bit. What is the value of the slope term? Now, this is important. It's negative 1.0448. There's a negative relationship. So that means that the slope is negative. So the correlation coefficient is negative. Let's write out the regression equation. Y hat is 109.27 minus 1.045. We just rounded. Minus 1.045 rounded. Okay, X. Okay. What's the correlation coefficient? Now, this is a trick question. It doesn't show the negative, but the cell, that's a mistake it makes or whatever. You're supposed to figure out that R is minus 0.792. Minus 0.792 is a negative real issue. As people get older in this company, the job satisfaction goes down. Okay, in fact, I can tell you how much it goes down. For every year, they get older. The job satisfaction goes down by minus 1.045 points. A little more than a point every year on a 100-point scale. Okay, what is the proportion of the variation in job satisfaction? The Y explained by age. Well, the R squared tells you it's 62.7%, 0.627. We explained almost not quite 63%, which means almost 37% has been left unexplained. So you might want to use another variable, and that will be called multiple regression, but not in this course. Okay, let's look at this problem. We'll look at absences from class and the score on the stat final. And as you can see, there were 17 observations. So we're looking at 17 points, 17 people. Okay, and I can get the scatterplot. You know how to do that. All right, first thing, is the regression significant? Well, look at the value there, 0.0030. Three chances in the thousand of getting this kind of sample evidence. A lot less than 5%. 0.003 is less than 5%. It is significant. We have a significant regression. Okay, so there's a relationship between absences, which we're going to use to predict the score on the stat final. So first, that's significant. Okay, question two. What is the value of B0, B1? Well, the intercept term, the constant is 88 point rounding now, 88.75. That's the constant term. The slope term again is negative, minus 4.40. So we read out the equation. Y hat, which represents the score on the stat final, is 88.75 minus 4.40 times X. Okay, if you want to predict the score for somebody who mixes class eight times, plug eight into the equation, 88.75 minus 4.40 times eight, and your prediction is they get a 53.55 on the exam. Okay, just make sure again, you know the slope is negative. The negative slope. So what happens to your correlation coefficient? It's going to be negative 0.6741. Again, the correlation coefficient and the slope have the same sign. There's an inverse relationship between absences and the score on the stat final. So again, make sure you know that R is a negative 0.674. Okay, what is the proportion of the variation of Y that is explained by X? Well, X did a good job. It's pretty good, but not perfect. Explain 45.4%. I want to be more exact. 45.44%. A bit more than 45% was explained, which means close to 55 was unexplained. You might need some more variables to explain. Now, one other question. Look at the intercept. I forgot to mention this. Okay, what is that? That's the value of Y when X is zero. So what is the value of Y when X is zero? X is zero means no absences. If you never miss class, according to the predicted model here, your grade should be 88.75. Again, that intercept, the constant term, is the value of Y, in this case, score on stat final when X is zero and the zero absences. So we'd be predicting 88.75. Every absence makes you lose 4.4 points on your stat final. Okay, let's look at this problem. Look at attractiveness on a 10-point scale. That means you're very, very good looking. Okay, that's attractiveness. And we want to see if it's related to wage. This could be some kind of discrimination if there is a relationship. So the company wants to know, is there a relationship? Are people more attractive, getting higher wages? In any case, they want to know, is there a relationship? And what is your conclusion? Look at that significance of F, 0.283. Again, it's not significant. That's a lot more than 5%. So the pattern you're looking at could very well be explained by chance. And the correlation coefficient, even though technically it's R is 0.268, it's not different from zero. You would say it's not significant, because it's not different than zero. You don't have evidence that it's different from zero. And the same with the R squared. Look at the R squared, 0.072. You really haven't explained much, but you haven't explained more than zero. It's mathematically, statistically. Again, really, when you take one look at this, you just tell your boss, no relationship between attractiveness and wage. There's no relationship period. Don't use one to predict the other. They're not related. Remember, as always, the best way to study for an exam in this statistics course is to do as many problems as possible. And that includes problems which you solve using Microsoft Excel.