 Greetings and salutations. This is Monica Wahee, your library college lecturer bringing to you chapter 4.1 scatter diagrams and linear correlation. So here's what you're going to learn. At the end of this lecture, you should be able to explain what a scattergram is and how to make one state what strength and direction mean with respect to correlations and compute correlation coefficient are using the computational formula. And finally, you should be able to describe why correlation is not necessarily causation. So let's jump right into it. First, we're going to talk about making a scatter diagram. And the thing on the right side of the screen is not a scatter diagram, but it's kind of scattered. So I put it there, it's kind of pretty. And then next, we're going to talk about correlation coefficient are and how to make it. And then finally, we're going to do a shout out to causation and lurking variables, which remember we talked about before, but we're going to talk about them again, in relationship to are. So let's start with a scattergram. And I also call it a scatter plot, because it's like everything in statistics, there's got to be about eight names for everything. So scattergram and scatter plot mean the same thing. So let's just get with the setup here. So scattergrams or scatter plots are graphs of x y pairs. So what's an x y pair, x y pairs are measurements, two measurements made of the same individual or the same unit. So if you measure my height and my weight, that's an x y pair. If you measure my height and then my friend's weight, that's not an x y pair, because that's two different people, right? So these x y pairs, the x part is called the explanatory or independent variable. And it's always graphed on the x axis. So remember in algebra, you would do these graphs, where you had this vertical line, and that was the y axis, and this horizontal line, which was the x axis that I always had trouble remember, which was which, but that's how it is. And so whichever acts, whichever of the pairs is x, expect that to be graphed along the x axis. And it's also called the explanatory and or independent, remember, there's got to be million names for everything, explanatory or independent variable. So if I talked to you and said, here's an x y pair. And this one is the independent variable, or this one is the explanatory variable, you need to like just secretly know I'm talking about the x of the two. And then surprise, here's the y of the two. And the y is also called response variable. It's also called the dependent variable. And that is graphed on the y axis. So again, like I said, I used to have trouble remembering is the vertical one, the y axis or the horizontal one. But what I did was I remembered, if you take a capital Y, and you go grab onto its tail, and you go pull it straight down, you'll see that it's vertical. And that's how I remember that's the y axis, it doesn't hurt the y it's used to that. So if you can stretch the wise tail down and you get vertical, remember, that's the y axis, and then the other one is the x axis. Okay, and then also you have to find a way to remember which one means what like does x mean explanatory and independent or what or does it mean response independent. So how I do it is, you know how we sing the ABCs ABC, D E F G. Well, if you fast forward, the end is w x, y, z, right? So the x comes before the y, you know, in the alphabet. So I do x, and then an arrow to y. And then I imagine in my head that's saying x causes why even though and doesn't necessarily cause wise, you'll see at the end of this lecture. But I think about it that way. Because if that happens, then why is dependent on x and x is independent, it can do whatever it wants. But why is dependent? So that's my way of remembering x is the independent variable and y is a dependent variable. So anyway, that's a long way of saying the scatogram is a graph of these x y pairs. And that's what we're going to do is make that graph. So we needed some x y pairs, right? So I asked the question, do the number of diagnoses a patient has? Does that correlate with the number of medications she or he takes? So if you don't have that many diagnoses, you probably aren't on that many meds, right? But if you have a lot of diagnoses, you should be on a lot of meds. But we all know people in a real life can sort of violate that just depending. I mean, you could have one really bad diagnosis with a lot of meds, or you can have a bunch of diagnoses that are all taken care of with one meds. So it's not perfect. But this is kind of a reasonable thing to think. So what I did was I put up here just for x y pairs, as you can see, so I got four pretend patients. And you can see here's the first patient, that person has an x of one because they only one diagnosis but like I was saying, must be a bad diagnosis because that person has a y of three years on three meds for it, right? So that's how you read this table. So let's start making our scattergram out of these data. Okay, so here we go. So I labeled the x axis number of diagnoses, right, just to keep things straight, and the y axis number of medications. And then you'll see where I put the dot right, because x is one, I went over to one number of diagnosis, right, one diagnosis. And then, because why was three, I went up three to this three, right, and there goes the dot, that's where that first person gets a dot. Okay, you put it there. And that's what you're going to do with these other ones to is put dots. Okay, I just threw all the dots down so you can kind of see what was going on. But here's the second person, right? So that person had an x of three, so I went over three. And I just put those green arrows and just so you could see what was going on, they're really not part of the scatter plot, it's just more like, like, cheating, you know, to show you because we're just practicing, right? And then that person so had an x of three, and then a y of five, and you see where the dot goes, right? And then here, you can see where the fourth got dot, or I'm sorry, the third dot goes because there's a four and a four. And then here we have the fourth dot. So this is the scattergram of these four patients, of course, a lot of times you have like hundreds of patients in there. But I just showed you the simple example. Okay, now, because we did that, I can talk about linear correlation, and you'll kind of get it, right, linear correlation, that term means that when you make a scatter plot of XY pairs, it kind of looks like a line. Now over here on the right is not like biology, that's not like statistics, that's like algebra, right? Because back in algebra, you'd have these perfect lines where the dot was right on the line and see the x and y notice, there's no diagnosis, nothing that's algebra, right? So perfect linear correlation looks like graphing points in algebra. And if you actually make a scatterplot of like people XY pairs, and you see that, you should suspect there's something wrong. It actually happened to me once. One of our statisticians came to me and said, Monica, look at this, you won't believe this. I said, Whoa, I don't believe this. What are you graphing? And he said on the x axis, he had put the weight of every of the person's liver. And on the y axis, he put the weight of the whole person. And I'm like, I, how do you weigh people's livers? Like that sounds painful. And he goes, Oh, let me go see. And what he learned was that you don't weigh people's livers, you use an equation to estimate the weight of their liver and guess what's in the equation is their actual weight. So I'm like, that's why I came out like on a line is because you were using the y to calculate the x and he was like, Oh, you're so smart for a secretary. So then I became an epidemiologist. But anyway, if you ever see this in biology, just suspect something fishy, because really, things just don't end up right on the line. But if they get really close, you can say it's close to perfect linear correlation. I just wanted to let you know that's what we're what's going on here with this linear correlation. Okay, so let's talk about facts about linear correlation. So things can be linearly correlated without being perfectly on the line. Obviously, our little thing was so if if when you make those dots, your scattergram, if you imagine a line going through it, if you imagine that the line is going up, like it kind of looks like it's going up, this is called a positive correlation. But you don't always have a line going up. So I want you to look at this and I made up these data to but on the x axis is the number of patient complaints. So as we go on, the patients are matter and matter, they're grouchier and grouchier, making more complaints. On the y axis, we have number of nurses staffed on the shift, right. And so as you go up, there's more nurses. Well, sure enough, when you got a lot of nurses, you don't have as many patient complaints, right, because they're being attended to. So this is what you would say is some people say inverse correlation. But in this presentation, I'm calling it a negative correlation, because as one goes up, the other goes down. And as one goes down, the other goes up, because and that's depicted visually with this line going down. So you see, you can imagine a line going down, that's a negative correlation, neither is better, you know, positive versus negative, it just explains how these things are behaving together, how x and y behave together. But then you can have situations where there's really no correlation, like x and y really don't have anything to do with each other. So as you've seen, you know, when you're, when you have patients in the hospital, some of them have really big families and those families come a lot. And some of them don't really have that many loved ones. So as you can see along x, here are total unique visitors, meaning you just count each person once. So you could have there, there's a patient who only has one unique visitor. But if you look at why day spent in the hospital, that person has been there seven days and that that visitor keeps coming, right? And then you have maybe a patient here, the second one, there's two unique visitors. And that person's only been in one day, but both those people have been there. Then you have people like a person with three unique visitors. And they've been in the hospital for days, right? And those are probably the same three people coming back. So it really doesn't matter how long a person's in the hospital, if they've got a lot of loved ones who keep coming, they'll keep coming or not, right, according to this correlation. So you end up imagining a straight line. And that's no correlation. That's fine to nothing is better or worse. It's just that you make the scatogram to try and understand how x and y are related. This is always fun. Like in books, they always make some sort of goofy picture. I don't know why they do this, I would never get a goofy picture, like they show in books about, you know, this, I made up the correlation, this is in the lobby, the number of the games in the lobby and the number of the books in the lobby, they should really have nothing to do with each other. But if you see something just way goofy like this, just say it's no correlation. I don't even know how I get this. Hi there. All right. So we've been talking about correlation, and it actually has two attributes. So far, we've only talked about one, and that is direction. We talked about positive, negative, and no correlation. So whenever you're talking about a correlation, you have to say what direction it is. But you also have to say the other thing, which is what strength it is. So now we're going to talk about how you figure out what the strength is. So strength refers to how close to the line, all of the dots, they fall really close to the line, it is considered strong. If they fall kind of close to the line, it's called moderate. And if they aren't very close to the line is weak. Now remember, that's totally different from what direction it is, it could be positive strong, or negative strong, right, could be positive, moderate or negative moderate. So this is just a statement, the strength is a statement of how close the dots you make in your scattergram fall close to the line that you end up drawing. So I thought I'd just give you a few examples. So look at this, I just made this up. This is what a strong negative one would look like. Notice how those pink dots are almost on the line. And this is a strong positive. Again, even one of the dots is on the line, right, not all of them, you know, or it'd be perfect, but it's never perfect. So this is really close, but it's strong positive. So strong just refers to the fact that the dots are almost on the line. Now this is almost the same correlation, but the dots are not really almost on the line. The line has to be fair and kind of go in between them, but they're kind of far away. And so just eyeballing it, you would say this is moderate. And here it gets weak. And mainly it's because the dots are more all over the place. But you'll notice there's one that's like right on the x axis. And then hey, look up there, like in the title, there's one up there, like way up there. And that's like an outlier. And sometimes when you get outliers, they can really whack things out. So even though this is a weak correlation, that line looks like so powerful, because it's almost basically connecting these two outliers. So you just got to be careful. And that's part of why you make a scattergram first is outliers can have a really powerful effect on the correlation, especially if it's in any of the four corners of the plot, like if you get a weird outlier kind of in the middle, it's not going to do as much as if it's in the upper right upper left, lower right or lower left, it can really affect the direction like like, you know, it's like a seesaw, or a teeter totter, you know, an outlier can get on and really change the direction of it. And it can also mess with how strong or weak the correlation is. So that's why you really want to start with the scatterplot. And that's why the way this chapter is organized starts with the scatterplot is you just want to look for outliers and also just see how x and y look when you plot them. Now we're going to get on to correlation coefficient are we're going to get on to computation and actually making a number. So you can not just use watery terms like direction, you know, positive negative, or moderate, strong, weak, to explain it, but you can actually put a number on how correlated x and y are. So remember the word coefficient, we did it with coefficient of variation, which is different. So the CV, you know, is one kind of coefficient. But what we're going to talk about is a different kind this time, our coefficient this time is called r. And just coefficient means a number, we just like to use the statistics. Now it seems kind of weird, because like, I'm talking about correlation, and people are like, Well, why is it our why isn't it like C for correlation? And I'm like, I don't know, I didn't invent it. But this is how you can remember, you can go correlation, correlation. So correlation coefficient are so just remember our means correlation. And technically, our means sample correlation, population correlation coefficient, right, like his, you know, imagine you're correlating like height and weight in the population, like all everybody in a particular state, you actually need a Greek letter for that. And I showed on the screen, I don't know, it's this fancy P, I don't know the right name of it. But we don't actually cover it in this class. So I just wanted to show it to you, we're only going to focus on our which is the sample correlation coefficient. So what is our well, it's like I said, it's the numerical quantification of how correlated a set of x, y pairs are. And it's actually calculated by plugging all of the x, y pairs into the equation, I'll show you how to do it. I'll, and you can see that if you do it by hand, if you have a lot of x, y pairs, it'll take forever. So I tried to limit that. And like, remember, standard deviation and variance, there was like a defining formula and a computational formula. This time, I'm only going to show you the computational formula, it's in my opinion, way easier to do. But it gets you the same number. Alright, so that's what we're going to do is we're going to take a set of x, y pairs, and we're going to calculate are. But then how do you interpret? Well, let me just prepare you mentally for what we're going to get out of this calculation. The R calculation produces a number and the lowest number possible is negative 1.0. So that's perfect negative correlation. So if we were like in Algebra, and we had a line going down, and all the dots were on it, then the R would be negative 1.0. But that never happens, right? So if you want to think about it is like if you have a negative correlation, and you get an R that's like negative 0.95 or something really close to negative 1.0, that it's close to negative 1.0. So it's close to perfect negative correlation. That's how you want to think about it. And then the opposite is the highest possible number you can get for R is 1.0. But most people never do that except for that one mistake I was telling you about. And that would be perfect positive correlation. So if you see that you calculate an R, and it gets really close like 0.95, like I said, or 98 or whatever, then you're thinking, Whoa, this is really close to perfect positive correlation, right? And then everything else is in between. So like, you know, 0.5 or negative 0.3 or 0.02 or negative 0.09, like all of those are between negative 1.0 and 1.0. And that's where our should be. So let's say you calculate R and you get eight. Okay, you did it wrong, right? Or you calculate R and you get negative 2.3, like that's not right, it's got to be between negative 1.0 and 1.0. And if you make a scattergram, you should know whether it should be on the negative side or the positive side, or it should give you a hint. So this is just more to calibrate what to expect from R because it's kind of a big calculation. So I'm just going to give you some pictorial examples because remember, every single time we make R, right? We also have a scatterplot behind it. And I just thought, you know, it would be helpful to see some real life examples of R. These are real life examples. Okay, real life. You don't get this from just anything, right? I'm just teasing. But anyway, so I started with some negative hours because I'm feeling negative today. I went into the literature and I found this article about Oh, it's on MIT in Harvard. It's about the evolutionary principles of modular gene regulation in yeast. And all I know is I'm supposed to cut down on eating bread. So that's all I know about this. But they had these really nice scatter plots. So and they calculated R for them. So and they had a little line on them. So I thought I showed them to you. So if you look at the one that's labeled D, see where the dots are, right, and see where the line is. And this looks kind of like a moderate to strong negative correlation, right? Because the dots are kind of close to the line. And then when the group calculated R, they got negative point seven. And so that kind of makes sense because and then I put my opinion in the lower right, these aren't official cut points or anything. But I usually use these as a guy. See how I said negative point four to negative point seven is moderate. So I would have called that D one moderate. Now let's look at E. So see how the dots don't cluster so close to the line, as they do with the D one, that's going to make it a weaker correlation, it's still it's still negative, right? So it's negative point four And when you look at my little opinion, I still call that moderate, but it's on the low end. See that. And then if you look at F, see how many of them are like way far away from that line, and they're dragging it down. So now it's in the even weaker correlation, negative point two, five, right? And so then that's weak. And so this is just some examples to give you a pictorial. And now I'll be I promise to be more positive. Here's some positive ours. They didn't draw a line on this one. This is a different article, right? So obesity is associated with macrophage accumulation and adipose tissue. So again, try to cut down on bread. But anyway, if you look on the left side, you'll see all of these x y pairs plotted on the scattergram. And even though we don't have a line there, we can imagine it's going up. So we would expect this to be positive. But we also would imagine they're not really clustering around the line very tightly. So when we see that the R is point six, we're not surprised. I mean, it's on the high side of moderate in my world, which makes sense. But go look on the right one, you know, under the B one, look at how those you could almost connect the dots and get a line out of that. So that's really tightly hugging the line. And then we're not surprised to see that the R is point nine two. So that's pretty strong. So I just wanted to give you these pictorials before we actually went forth and calculated R because that's one thing you can do is do the scatterplot have an expectation what R should look like. And then if you calculate R and it's totally wacky, you know that you did something wrong. Okay, let's calculate R and let's use the computational formula. Okay, I threw the formula up in the upper left, and don't feel overwhelmed by it, we're going to take that apart very carefully, right? But before we even do that, I just want you to have a flashback to chapter 3.2. See all those sums of those capital signals in the equation. So we're going to handle calculating are a lot like we handled calculation, calculating variance and standard deviation. We're going to make like a table with columns, and then we're going to fill in those columns with calculations. And then we're going to add up the columns to get all those numbers. So already, you were good at that in 3.2, you'll be good at this too. And then I made up a story because it's a lot easier to check your work if there's some story behind it and statistics. So pretend we have seven patients that have been going to your clinic for a year, they're good patients, they keep coming. So they came to the clinic over the year. And at the last visit of the year, you measured the diastolic blood pressure. And what you predicted was or what you thought would make sense is those with a higher diastolic blood pressure would have had more appointments over the year, because probably they're trying to stabilize the blood pressure, or maybe they have other problems that are driving it up. This makes perfect sense, right? So what you wanted to do is see if you were right. So you're going to take the diastolic blood pressure at the last appointment as your acts, you know, because you think that that's maybe the explanatory variable, or, you know, that would be the independent variable that would make it. So have something to do with whether or not they had a lot of appointments. And then you take why as the number of appointments over the last year, because you'd say, Okay, if high DBP problem means they have more appointments, that's just your idea. Maybe you're wrong, but we're going to do that. Okay. So I put in the title, just a reminder, access DBP, and why is number of appointments? So you don't forget. And then we made up this tape. So look in the first column, it's just the patient number, it's nothing, you know, exciting, we just want to keep track of which patient is one, right? And then notice under acts, we just have all of their DBP. So this patient one at the last appointment had a 70 mmHg, and patient two at 115 mmHg, that's kind of alarming. But these are fake data. So don't get worried about these patients. But anyway, we just fill in x and then also, when you add their chart out, you can look up how many appointments they had over the last year and patient one only had three, whereas patient two had like 45, which you can believe because sometimes they're coming in all the time to get stuff adjusted. But then, you know, patient three only had 21 and patient four at seven. So you can see these are the XY pairs for each of these patients, right? And it's pretty simple to go to the bottom and sum up each of the columns, we have some of x is 678 and some of y is 166. And also, I'm reminding you of the R calculation, I put that in the upper right, just so we can see what we're doing. I just want to call your attention to one of the terms in there, which is sum of x, which I put in the parentheses here. And that we already know now, just from making the first part of this table and adding it up. So we already have that thing in there. Now, I just wanted to point out, if you saw the sum of x over here, it's not exactly the sum of it's a sum of x y. So the y is mushed right next to it. That's not some of x at some of x y. And that's later in the game, we're going to put the sum of x y at the bottom of the last column. So it's so that first term there, that's not some of x, that's some of x y. Okay, now downstairs, we see the sum of x to the second, right? And that looks an awful lot like the one next to it on the left, which says some of x to the second right. And so how do you tell the difference between the kind without the parentheses and the kind with the parentheses. So this is how I do it. The rule is always regardless of what's going on, do what's in the parentheses first. So that's easy to do if you have parentheses, if you've got the parentheses version, you know that the sum of x to the second with the parentheses in it is you just do the sum of x, and you do the sum of x and you times and by each other, right? But what if you don't have any? Well, what I do is I say, Well, if I did have some, I do it this way. But if I don't have any, then I know I have to do the sum of the x squared column, right? So that's where you take x times x, x times x, x times x on each line, put it there and sum that up. So that's how I go through it, no matter where I am in statistics or algebra. If I see that some symbol, and then the x squared, I first look for the parentheses. If they're there, I know what to do. If they're not there, then I know you don't do the thing where you just take the sum of x squared, you have to go and look at the bottom of the column of the x to the second column and take the sum of that. I hope this is helpful. Alright, so as you can see, there's I've shown you on the top of the equation is where you just take the sum of x and the sum of y. And on the bottom, I'm showing you where you take those and you take the square of them. And then in the other term is the one where you just take the sum of the call. Alright. And so there you go. So what happened here? Well, we filled in x to the second. So if you go to patient one, 70 times 70 is 4900. That's where we're getting that number. So you go through and then patient two 115 times 115 is 13,225. So you go fill all those and then you sum those up. And that's what goes in that first term over there. And then I'll bet you can guess what the next slide is. Surprise. Now we do the y one. So don't get confused because you kind of have to skip a column there. So three times three is nine. And so that's why in the y squared on 45 times 45 is 2025. That's how we're doing those. You sum all that up and then go look up at the equation. That's where you put that sum of y squared. Now we have x y and this reminds me of a student I had before. She was really confused. She's like, Monica, I don't know what to do with x y the x y quality. And I go, What do you mean? I mean, it's pretty obvious you just take x times y, like here, 70 times three is 210. She goes, x times y, where's the times? Like, how do you know it's supposed to be times? Like, I don't see any times. I'm right. I don't see any times either. Like, there's no, like, like, how do you know to do that? Well, anyway, I'll just tell you, I guess, imagine like a little multiplication symbol between x and y. That's what's supposed to be there. That's what you're supposed to imagine. I guess I was so used to looking at it was like, You're right. I don't know. I guess you're just supposed to assume that. So take x times y. So for patient two, we just took 115 times 45. And that's how we got 5175. So you go through each of those. This is a lot of processing. And then you sum it up at the bottom. That's a big number. And then you see I circled it in the our equation. So I think we figured out where to put everything. Obviously, n is seven, right, because we have seven patients, you see a bunch of n's in there. So I think we have all our our ingredients. So let's move forward. So all I did here was rewrite the exact same equation with all the ingredients in it. Right. So like I said, the n is seven. And so wherever you see and you'll see a seven. See that sum of x, y on the top, you see where that goes. See some of x and some of y and then downstairs, you'll see I filled in all those numbers too. Now let me just talk to you a little bit about both levels, the numerator and the denominator. In the numerator, because we have order of operation, you need to do out the end times the sum of x y that seven times 18458, you need to do that out first. And then you need to do the other one, you know, the 678 times 166 first. And then after you're done with those two things, you have to subtract the second one from the first one. That's the order you have to do that and to get the numerator right. Now for the denominator, it's a little bit the same, but a little more complicated. You see on the left side, you have that seven times 67,892, you have to do that out. And then you have 678 squared, you have to do that out. Then you have to take that subtracted from the first one. And after that, after you have that, you take a square root of all of that. And that's your first turn. And then you still have to go over to the other one, you have to take seven times 6768, keep that, then take 166 times 166, keep that, then that term you subtract from the first one. And after you're done with all that you take the square root of that. And then those two things, you have to multiply together. So that's a lot of work and you have to do in the right order. So here, I just wanted you to see how you probably want to just work out this term separately, first, and then work out this term separately. And just like that thing I was telling you about x, y, those two terms, once you work them out, you take the square root of the left one and the square root of the right one, you have to multiply them together to get the denominator. So this slide is to help you see, I threw the numerator on that was relatively easy. But these are the two different numbers you should get from the left side of the denominator, and the right side of the denominator just to check your work. And then of course, once you multiply them by each other, you get this number 17,561.3. So ultimately, with the calculation for r comes down to is you're trying to calculate the numerator, and you're trying to calculate the denominator. And at the end, you divide the numerator by the denominator, and you get the answer, which is r. So we're going to do that now. And here's what we got. It's we got 0.949. And because we see that it's positive, then we know it's a positive correlation. And then remember my opinion, and also probably everyone's opinion, because if you run that up, you go 0.95 Well, that's getting really close to 1.0. So most people would agree that that's pretty strong. So how you would diagnose this correlation is you would say it's positive and it's strong. Okay, I just want to wrap this up by giving you a few facts about our that I may not have covered yet. First, our requires data with a bivariate normal distribution, which is something we didn't check before doing our R in this class, because I just don't cover that. But please know if you take another statistics class and they bring up our they might talk about checking for the bivariate normal distribution. So just know about that. Next, please know that R also does not have any units. So other things that don't have units remember the coefficient of variation didn't have any units, some things just don't have units and R is one of them. Also, we did talk about how perfect linear correlation is where R equals negative 1.0. That's if it's a negative correlation, or R equals 1.0, which is a positive correlation. But I might not have mentioned that no linear correlation is R equals zero. Now you probably won't see that in real life. But sometimes I'll make an R and the R is either positive or negative, but it's 0.000000 something, right? Regardless of whether it's positive or negative, if it's 0.000000 something, it's really close to zero. So that means there's probably like no linear correlation. And then we learned about positive R and negative R. But I just wanted to remind you of the behavior of x and y when you get those circumstances. Okay, so if you have a positive R, it means as x goes up, y goes up. But it also means as x goes down, y goes down. So they travel together. When you get a negative R, it means as x goes up, y goes down. But also it means opposite as x goes down, y goes up so they travel in the opposite directions. Now, here's another fact about our little factoid. If you choose to switch the axes, like let's say I designate, you give me x, y pairs, and I designate a certain variable as x and a certain one as y, and you actually designate them the opposite. It really doesn't matter, even in the equation, because you'll end up with the same R value. So it doesn't matter if you call the x my x y, and I call your, you know, y x, like we can switch them, but you'll still end up with the same R with the calculation. Then finally, even if you converted x and y to different units, you get the same R. So let's say that you were in England, and you were doing the correlation between height and weight, and you were using the metric system on the same patients that I was using the US system, even though we'd have different numbers, because obviously you have to convert them, we'd still get the same R when we were done. So finally, we get to the last subject of this lecture, which is lurking variables, which you've heard about before. But the main point I want to make is correlation is not causation. So you don't want to be misled by correlations. So beware of lurking variables. So remember, lurking variables are things lurking behind the scenes that cause things, right? And so you may have realized that selecting x and y, like if you have x, y pairs, designating which one is x and which one is y is kind of political, because you're implying that x could cause y. So let's say that you're correlating height and weight, taller people are heavier. So you would cause x to be height and y to be weight, you know, people don't go, Oh, I'm too short, I should gain weight so I can grow taller. You know, that's just not the way things work. So you have to put x as the height. And why is the weight? But there are in reality other causes of weight besides height. In fact, there are things that cause both height and weight like genetics, right? So a genetic profile that leads to tallness and also obesity could be a lurking variable in the relationship between height and weight. So there could be some tall people that are always obese, and it's not really just because they're tall, it could be because they have the genetics that program them to be tall and also obese, right? And so here's an example where you got to be real careful with correlation. So there's been this claim that eating ice cream causes murders, because they notice when in areas where ice cream sales go up, murder rates rise. And I don't know about you. But when I have some really good ice cream, it just makes me so mad. I'm just kidding. I mean, why would this happen? Right? Well, the reality is summer and warm weather are lurking variables because you sell more ice cream in the summer. You know, the ice cream consumption goes up. But also people are outside more and more murders occur. And you know, I'm from Minnesota, where it gets really cold for periods of the winter. And oh my gosh, there are totally no murders, then, like people just don't commit murders when it's really frigid out, it's just really inconvenient. So that's a situation where there's a lurking variable. And so you don't want to start, you know, screwing up our ice cream laws and making it so we can't have ice cream, just because you misappropriate that ice cream causes murders, right? There's a lurking variable behind it, that's having something to do with both. Here's another one. And this was my professor in my biostatistics class use this, he put up a really like a time series chart over a long time, like since the 1900s. And he pointed out, as people purchase more onions, over time as onion consumption goes up and down, the stock market rises, right? So when the stock markets, little people aren't eating as many onions. And this is just true over generations in the US. So, um, yeah, we've had some problems with our economy in the US, do you think we should all start eating a bunch of onions? Right. So the healthy economy is a lurking variable. And a healthy economy, people buy more food, they including onions, and also a healthy economy boosts the stock market. So you got to be careful about this correlation is not causation, you know, and so if you want to make the stock market go up, don't make everybody eat onions. And definitely don't make a stop eating ice cream, that would make me very upset. So at the end of the day, you're not going to be able to affect the murder rate by bringing down the ice cream consumption rate, and you're not going to be able to fix the stock market by making people eat onions. And so that's the whole concept behind lurking variables and correlation is not necessarily causation. So in conclusion, when you're doing your correlations first make a scattergram, because you want to get an idea of visual idea of the strength in their direction. And you also want to look for outliers. Then go on and calculate R by hand, but be really careful because it's a big hairy calculation. And you don't want to make any mistakes. And then finally, when you go to interpret R, be careful of lurking variables. And remember that correlation is not necessarily causation. And now time for some ice cream.