 Welcome to our lecture on correlation. In this course, we're going to be looking at linear correlation. The goal is to measure the strength of the linear relationship between two quantitative random variables. Each one has to be measured at least on an interval-scale level of measurement. Now, researchers are often trying to find out whether two variables are related. For example, you might be interested in looking at longevity, how long people live, and how many calories they consume per day. Or you might be interested in seeing the amount of time spent on the internet, hours spent on the internet, and high school average. Well, college GPA for that matter. So anyway, there are lots of cases where you might be interested in just looking at the correlation. And you'll see in a moment how this works. I'm going to show you a simple formula to compute R. R is the correlation coefficient. And that R ranges from minus 1 to plus 1. If you manage to get an R of plus 1, that means it's a perfect positive linear relationship. In fact, when you plot it, you see all the points are on a straight line. An R of minus 1 indicates a perfect negative of something called inverse linear relationship. And an R of 0 indicates there's no linear relationship. No relationship between the two variables you're examining. An R, this R is a sample correlation. Because you're taking a sample, that's a 50, 30, or 100. But it's a sample. You're not taking the correlation for the entire population. If you were going to do that, the population correlation coefficient is rho. That's the Greek R. It's called rho, and you can see it on the slide. Again, you can only compute rho if you take a whole census. So R, in effect, is an estimate of rho. If you look at the top of the slide, you see there's a scatterplot which shows R is plus 1. You see all the points are on a straight line. You can draw the straight line. No point will be off the line. That's an R of plus 1, which indicates a perfect positive linear relationship between the X and Y variables, the two variables. Again, if you find that all the points are on a straight line and it's at a positive slope, that R is plus 1, but it's not going to happen generally in the real world. You're more likely to see something like on the right. See the scatterplot? You can see it's linear, and it's a positive linear, but it's not going to be plus 1. Some points will be above, and some points will be below the line. So in the real world, you don't really see perfect relationships. But you might see a very strong positive relationship. For example, you look at hours studied in grades. I guarantee you, and I think everyone would admit, people who study more in general get higher grades. So it's going to be a significant positive relationship. But there are other variables as well. That's why you're not going to get an R of 1. There are other variables. It's not only studying that affects grades. There are two students each spending 20 hours studying for an exam, and one might get 100, and the other might get an 80. There's also some random effects. There's always randomness. Sometimes you call it noise. There's noise in the system. And there's also variables you didn't take into account. Like if somebody has a high IQ, that may help, and they can get a high grade less studying. Or they may have previous knowledge or other factors. Generally, you might want to use several variables. But in this course, we're doing simple linear correlation. Now we'll look at negative relationships. If you get an R of minus 1, which again is not going to happen in the real world, you see all the points. They're all on a straight line on top. You can see the scatter plot. All the points are on a line. And R is minus 1. But again, in the real world, you're not going to see something like that because that would be a perfect negative linear relationship, or something that's called an inverse relationship. But if you look at the side of the scatter plot, you see the points are definitely linear and negative slope. Definitely. But some points are not going to be on the line. Many of them are actually, but some won't be. So this will show what we see here is a strong, very strong negative linear relationship, but it's not going to be minus 1. What does an R of 0 mean? Again, that's generally not going to be exactly 0. What it means is there's no linear relationship between X and Y. Again, in the real world, you very rarely see an R of 0. But you might see a low R. Now, look at the two scatter plots. The one on the right, well, you see that in a sense, there's some kind of relationship there. It's a perfect circle, but R is going to be 0 because there's no linear relationship. On the left, you see no relationship. There's nothing. It's just a bunch of random points. But you try to take a bunch of random numbers and get the R, chances are it won't be 0. But it might be like 0.02, 0.03, 0.04. And once you take more advanced courses you'll learn, even this course somebody might teach you, how to test for significance. But when you're testing for significance, you want to know, is the R statistically different from 0? But again, in the real world, you don't see an R of exactly 0. But you might see a very low R, which is basically not statistically different from 0. Something that's very important. Once you test X and Y to see if the correlation is significant or not, okay, now you decide that there's a significant correlation. Let's say the R is 0.8. It's pretty strong. It can't go higher than 1. Let's say the R is plus 0.8. So you have a strong correlation. Let's really say correlation implies causality, Y. There are four explanations and you find let's say X and Y are related. One possibility is X causes Y. Another possibility is Y causes X. Another possibility is Z causes both X and Y. It's not X and Y that are related. It's a Z factor that affects the X and the Y. And finally just a fluke. We call that spurious correlation. So we have this problem all the time in research. Make the assumption because two things are related that they decide which causes which, like poverty and crime. Which causes which? Is it that people are poor and that's where they commit crimes? Is it the other way around? People who commit crimes can't get jobs, which is true by the way. If you commit crimes, very hard to get a job. So that's where they're poor. We don't know. And maybe it's a third factor. Here's another example. We know there's a lot of older singles suffer from chronic depression. Which one causes which? Is it being single? You're single and you're older and you have no loved one. So that's why you're depressed. Or maybe it's the other way around. People who are depressed are going to be single. Nobody wants a date. You want to marry somebody who, all they talk about is committing suicide? I don't think that would be a great date. How about this one? Cities with more cops also have more murderers. This is the fact that you have more and more cops. Does that mean that causes the murderers? Is it the other way around? Is that because there's a lot of homicide in your town? They're going to hire more cops. See, that's why it's very hard to determine. And you shouldn't use correlation to prove causality. It's usually the other way around. If you don't see two variables related, then you might say there's no causality. Here's another example of this. We know that people wear more clothing. The amount of clothing, I guess you can weigh it. And see when it's really cold, people are wearing more clothing. So there's definitely a correlation to how much clothing you're wearing and the temperature. The more clothing, the lower the temperature. But zero outside, we're all going to be wearing heavy coats. So clearly there's a relationship. But we all agree that lots of people want the temperature to go up. It's really cold outside. So we say, OK, we're going to do how to make it all go up. We're all going to wear very little clothing because we know there's a relationship. So we're going to wear very little clothing. We'll go out wearing just bathing suits. And we're going to change the weather. So that's going to make the weather warmer. We all know that we're crazy. But researchers, without knowing it, are doing that. Here's another example, umbrellas. The number of umbrellas outside will increase. There will be more umbrellas outside than if it's raining, especially a strong rain. But we all know that even though there's a correlation between the number of umbrellas and rain, you can't make it rain by going outside with an umbrella. So just keep in mind that correlation does not prove causality. If we square the correlation coefficient R, we get something called R squared, the coefficient of determination. Remember, R can go all the way to negative one. It can go all the way to positive one. When you square it, you only have a positive number. And in fact, what you have is a percentage. The definition of R squared is that the proportion of the variation in Y explained by X. Now, typically with correlation alone, we're not interested in which variable is Y and which variable is X. This becomes much more important than regression. But we can look at it now anyway because you know correlation, you know the correlation coefficient, and you know how to square a number. And you can always switch things around if you want and call X, Y, and Y, X. However, let's imagine that we know what we're doing and that we're studying the variable Y. Like, why are the grades in my class variable? Why doesn't everyone get the same grade? I'd like to know that. And I'm trying to use X to explain that. And one example would be, well, how many hours did each student spend studying for the test? So I'm trying to look for explanations of the variability in Y. And there will be probably more than one factor, but in this case, I'm only looking at one. So all of that is to say, very simply, R squared has a meaning. It's easy to understand and easy to explain. It's the proportion of the variation in Y that's explained by X. And Y is what I'm interested in. Y is what I'm interested in studying. But I know that at this point, we're not really making distinctions between X and Y. Although we're going to very, very quickly move from correlation to regression, and you will want to know which variable is the independent variable, the X, and which one is the dependent variable, the Y. Okay, let's see how this plays out. If we have a straight line with a positive slope, if our data falls on a straight line and R is one, the correlation coefficient is one, well, that means that R squared is one. If the correlation coefficient is a negative one and the line is a negative line with a negative slope and inverse relationship, R squared is one. Either way, we're talking about 100%. 100% of the variation in Y is explained by X. There are no other factors. Every single point lies on the line. The variable X does a perfect job in explaining Y, and there's nothing unexplained. There's no need to look further. Obviously, those are unusual cases. Let's see what happens to R squared with some other typical values of R, the correlation coefficient. If R is 0.3, 0.30, or negative 0.30, R squared will be 0.09. Now, we look at 0.30 and we say, well, I don't know. It's not zero, but it's not that high either. And then you look at R squared and you say, wow, 9%. That's very little. Only 9% of the variation in Y is explained by X. 91% is unexplained. So R squared is much easier to communicate, to talk about what this relationship or non-relationship implies. Let's look at some other numbers. Look at the 50% mark, the R of 50, correlation coefficient of 0.5 or minus 0.5. R squared is 0.25. It's still very low. 25% of the variation in Y is explained by X. That's not an awful lot. It's not nothing, but it's not an awful lot. On the other hand, if we have a correlation coefficient of 0.9 or negative 0.9, R squared is 0.81. That means that 81% of the variation in Y is explained by X. 19% is unexplained. That's okay. That's pretty good. And a correlation coefficient of 0.9 or negative 0.9 is indeed considered to be very strong. Yes, that's a pretty big formula. This is the way we compute R, the correlation coefficient. Do we usually use statistical software like Excel or SPSS? Sure. Is it difficult? Nah. All you need is N, the sample size, which means the number of pairs of data that you have, because your data comes in pairs and XY pairs. And then you need five summations. You need the sum of the X, the sum of the Y, the sum of the products of X times Y, the sum of the X squared, each X squared, they add those up. The sum of the Y squared, each Y value squared, and add those up. That's all you need. So the formula looks complicated, but it's not really that bad. And certainly, if you're using any kind of statistical software, it's really, really easy. But you should have some experience, at least try to do this on your own once or twice. Example one, we're looking at the correlation of grade and height. We have 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. We have 10 students. Each student has a grade on an exam and a height in inches. So you ask each student, how tall are you, and those results are in inches. So we're looking to see if there's a relationship between the grade a student gets and the student's height. I have a theory. I don't know if any of you have ever met me in person. If you have, you know that I'm not exactly tall. And I don't really see the problem because it seems to me that we ought to find out one day. I think science has been remiss that tall people are not as smart as short people because it takes a long time for the oxygen to travel up the body and to the brain. And to me, that just seems obvious. So I decided to test it out. I have this data, grade and height. I got the summations. And we have the scatter plot. And let's take a look at it. Yeah, we're going to move to the next slide to look at the data and to get the correlation coefficient. But I have to say, just looking at this, I might be mistaken in my theory. I think the data does not bear it out. That looks like a pretty random scatter plot to me. What about you? If you recall the scatter plot from the previous slide, the values that are plotted, the pairs that are plotted really look to be pretty random. It doesn't even matter what the correlation coefficient turns out to be. Anyone looking at the scatter plot would say there is no relationship between grade and height. But for practice, let's do it anyway. So you can see the calculations for R using that big formula and the summations that were computed. You end up with an R value of 0.1189. If you square that, you get an R square of 1.4%. What that means is that the percent of the variation in grades that's explainable by height is less than 1.5%. That's nothing. More than, you know, almost 100% of the variation in Y is explainable by something else. That was pretty much expected when you looked at the scatter plot. Of course, eventually you will learn how to test the correlation coefficient R for significance. You're testing it against 0. You may do it here. It may not be done in this course. It may be done in a future course. But you should know that it exists, especially since when you get your output from Excel or SPSS or any other statistical package, there will automatically be a p-value to test for significance. So anything, even if this had turned out to be significant, which of course it didn't, 0.1189 is not significantly different from 0, anything less than 0.3 would not be considered, anything we'd want to talk about. Very unimportant, useless, and irrelevant relationship if there is one at all. And of course, if you really want to do a better job, you're going to want to take a sample of more than 10. But either way, even if you have a sample of 10,000 and you end up finding that the correlation coefficient is significant, it wouldn't really impress anybody. Unfortunately, I'll have to revise. Maybe I'll revise my theory. Let's move on to the next problem. Well, we failed at proving the relationship between grades and height. We flopped on that, no relationship. Let's try a different study. We're looking at grades and how we're studied. Look at the scatter plot. Right away, just looking at that scatter plot, a researcher would say, it looks like there's going to be a strong, positive, linear relationship. And again, to make life easier for you, we give you the sum, sum of x, sum of y, sum of the xy, sum of the x squared, sum of the y squared, and we're going to calculate by hand r. Well, look at the r. And again, we show you how to calculate it by hand. And we get an r of positive 0.97. That can't get higher than 1, so we've done very, very well. We're showing a very strong, positive linear relationship. And the r squared is approximately 94%. We want to be exact, 94.09%. That means we almost explained, we explained slightly more than 94%. So approximately 6% is left unexplained. So we've done a very good job. And as we said numerous times, you might want to test with significance, but we're, at least in this slide, we're not going to show you how to do that. But you really should test with significance, but chances are it's significant. And we can tell you, if the correlation coverage is more than 0.8%, you can say more than significant. It's strong. Here's 0.97. So you can say it's a very strong correlation. You can't use the word significant until you test for significance. And this problem we're trying to see the relationship being price and quantity demanded. Okay? And at least two of the sums are on the bottom. See some of the x, some of the y. We give you all the sums. But I look at the scatter plot first. Looking at that scatter plot, there's no way in the world this is not going to be, you know, showing a strong negative relationship, a negative linear relationship. You can almost draw a line through most of those points, not all of them, but most of them. Okay? So we're going to just calculate r the hard way. Look at your r. You end up with a negative, negative 0.99. It can't get less than negative 1. It's almost a perfect negative relationship. That's because most of the points were on the line. I think one or two were off the line. So it's negative 0.99. The r squared is about 98%. That means you explained 98%. Price explained 98% of the variation the quantity demanded. Only about 2% is unexplained. You know you're finding a very strong relationship, but inverse relationship. Okay? And we test with significance, but again you have to trust us that it was significant. So we have a significant correlation. It's a negative 0.99. Almost perfect. And we can even use the word strong now because eyeballing it and saying that if you explain 98% of the variation in something, the y here again is quantity demanded. If you explain 98%, you've done a great job of explaining. So you've done a very good job. So we can actually hopefully test with significance. And we'll say there's a strong inverse relationship and significant too between price and quantity demanded. Here's an interesting problem. It actually is based on a real live study to see is there a relationship between how attractive somebody is and their salary with the given. What do you think? Well, there's this famous study looking at that. As you probably know I got a very high salary when I was hired by Brooklyn College, right? I think my cause is laughing. But how high was your salary when you were hired? Anyway, here's the data. Assuming we have a panel of judges that's what they usually do. They have a panel of judges and they come up with a rating for each person. We took 10 people randomly and the lowest rating is a 0. The highest rating, which by the way is my rating was 9. So we have the scores from 0 to 9 and we see how much they started with. These numbers are quite low but this could have been done a long time ago. These are made up numbers by the way. So we give you the n. Notice n is 10. 10 pairs, 10 people actually we're looking at. We'll get 10 people and we have this starting salary, that's the y variable. That's a dependent variable and we have the sum of the x and sum of the y. We have everything you need and now we're going to calculate r. We're going to look at the relationship between attractiveness and salary. We're giving you all the sums that you need and the n, notice the n of 10 is there and we calculate r and it's positive 0.891 and the r squared is about 79.4% I'm rounding, 79%. So we end up with r of 0.89 which is pretty high, can't go higher than 1 so we're not far from the 1 and we have to test the significance. If the trust us on this it is significant and so we know that we found a significant relationship between attractiveness and salary and we have a strong relationship too. It's not just significant. It's more than 0, it's not 0 but here we can say that it's quite strong, 0.891. We've explained 79.4% of the variation and as we've told you, there's a way to test the significance. It's a lot easier to have a computer do it for you and we're going to see when we do regression how Excel does this for us. If you go to our handouts page you're going to learn how to use Excel how to do scatterplot, everything is there so you can go there to see how to do this. So in this problem we're going to use Excel. This is the real world. In the real world we're known to expect you to take a pencil and paper and start calculating standard deviations for your relations. Too much work and you're bound to get it wrong if you do it by hand. So we're looking at years of education. We have people who had only 10 years of education meaning they dropped out of high school and then we have people who went all the way to 20 years of education and we're trying to relate it to their wage dollars per hour. Notice the first two people who only had 10 years of education one is getting the minimum wage of 15 another one is getting 18 is there a relationship? Well if you go to the function wizard that's like little fx that you'll see in Excel after you put in your data of course put in the data you type it in and then go to the function wizard and you're going to see there's something called correl that's for correlation. So you look for correl correlation and you're going to have to put in the array where the data is located. So array one that's going to be from B four all the way to B 20. You're basically showing where the data is located and then it's going to ask you if you have to put in array two and that's C4 to C20 and by the way they better be the same amount of data in each one because it pairs. So you can if you put array two is C4 to C22 you're going to get a slap in the face from the computer. Okay, if the computer had a hand it would slap you. It's got to be the same it's got to match. It's B4 to B20 and C4 to C20 it doesn't matter which is X and Y when you do correlation you can reverse it doesn't make a difference and the computer will give you the correlation coefficient with about a million decimal places. We ended at four 0.7264 it doesn't tell you whether it's significant or not unfortunately but there's ways to do that but we know that the correlation coefficient R is 0.7264 and it's positive and for R squared we can put by hand just square the R and we get 52.77% that's done by hand and we also asked for the scatter plot which you can see and you can see it's kind of a positive linear relationship so that's the easiest way to get the correlation coefficient and using Excel with that function wizard. Thank you for attending this lecture we had a lot of fun with correlation I'm not even really joking I think this was a lot of fun it's interesting to look at different relationships that you can search for in your variables back to what we always say at the end of the lecture the only thing that will work for you is to do lots and lots and lots of problems practice practice practice find the problems wherever you can and especially in this type this subject this topic use your calculator and also use Excel you want to get practice using both