 Statistics and Excel. Correlation and Regression. Got data? Let's get stuck into it with Statistics and Excel. First, a word from our sponsor. Yeah, actually we're sponsoring ourselves on this one because apparently the merchandisers, they don't want to be seen with us. But that's okay, whatever. Because our merchandise is better than their stupid stuff anyways. Like our, trust me, I'm an accountant product line. Yeah, it's paramount that you let people know that you're an accountant. Because apparently we're among the only ones equipped with the number crunching skills to answer society's current deep, complex and nuanced questions. If you would like a commercial free experience, consider subscribing to our website at accountinginstruction.com or accountinginstruction.thinkific.com. First question, what is correlation? Correlation measures the strength and direction of the linear relationship between two variables. And whenever we think about correlation, we have to keep in mind the common phrase of correlation does not necessarily mean causation. However, when a phrase becomes as common as this one, correlation does not equal causation. It often loses some meaning. People often saying it as a kind of mantra without really thinking deeply about what the mantra originally meant. So we'll come back to this phrase here and possibly multiple times through the presentation. But first, a quick recap of what we have done in prior sections and prior sections. We have just tried to describe different data sets using both mathematical descriptions such as calculations like the mean or average, the median, the mode, the quartiles and using pictorial representations like a box and whiskers or histogram. Remembering that the histogram is one of our primary tools to visualize the spread of the data and we're able to use descriptive terms about the histogram to describe that spread of the data on the histogram such as the data is skewed to the left, the data is skewed to the right. We then thought about certain types of data sets that might conform to approximating a mathematical type of calculation of a line or a curve like a uniform distribution, a Poisson distribution, an exponential distribution and a bell curve or normal type of distribution. And if we can describe the data with a line that has a function behind it, that gives us more predictive power. So now we're thinking about two data sets or possibly more than two data sets in some circumstances, but we're starting off with two data sets to see if there's some kind of relationship between those data sets. First, thinking about that relationship as a mathematical relationship. In other words, are the dots and the different data sets moving together in some way, shape or form. And this is where we get into the differences between correlation and causation. And the reason this is important is because with that phrase, correlation does not necessarily equal causation. People might come up to react to that in ways that aren't exactly correct. One reaction might be, well, that's so common that I'm just going to dismiss it. I'm not even going to think about it. And I'm going to revert back to what I normally do, what is natural for us to do, which is if there's a correlated activity of data points, we as human beings are hardwired to think that there's a cause and effect relationship. So that's why the phrase came up in the first place to say, hey, no, sometimes this stuff, if you just come through data, you're going to find correlations that have a mathematical relationship that aren't cause and effect related. And if you know that, those things can be kind of funny because you go like, hey, look, this is mathematically related and it's obviously has no cause and effect type of relationship. But the other result that people might come to is to say that the correlation is meaningless because correlation does not equal causation. So what's the point of doing the calculation if correlation doesn't equal causation? Because what we're trying to do usually is try to find a cause and effect type of relation with the mathematical calculation of a correlation. So note that if there is a correlation, it doesn't necessarily mean that there's a cause and effect relationship. However, if there is a cause and effect relationship, you would think we would be able to find a correlation. So in other words, the correlation is still important because it might lead us to the question of, is there a cause and effect relationship? It's more likely that there is a cause and effect relationship. If there is a correlation, but we have to be careful that there's not always a cause and effect relationship. So the general idea would be when we're thinking about correlation, we're looking at different data sets to see if there's a mathematical relation between the different data sets. And if there is a mathematical relation or correlation between the data sets, the next logical question would be, is there a cause and effect relationship between the data sets? And if we determine that there is a cause and effect relationship between the data sets, which is causing the correlation or mathematical relation, the next logical question would be, what's the causal factor in the cause and effect relation which is causing the correlation or mathematical relation calculation between the different data sets? Okay, so types of correlations. We have the positive correlation where both variables increase together. In this example, we have hens and we have eggs. So these are our two different data sets. We have how many hens do we have and how many eggs they are producing. I think this is like in a year, for example. So in this case, when we plot this out, we have on the x the number of hens. So at three hens, we had around 100 eggs at five hens. We had a little under the 200 eggs. Six hens are at the 207 hens. We have up here getting close to the 350. Noting here that usually when we're thinking about something like hens and eggs, we might think about the hens as the things that is causing the eggs. So I can see a correlation here. Clearly these dots look like they're tending in a particular direction. And if I was to make a hypothesis about why that is, I would think there would be a cause and effect relationship between the independent factor, the hens, which are causing the eggs. Now note, you could think about that the other way around. If you're a former, in other words, I might say I'm going to see how many hens I need to buy in order to generate so many eggs. But you could think about it the other way around. You might say, well, the eggs are causing the hens and then the hens are causing the eggs, right? So you might try to buy the eggs first and think about it as the independent variable. But again, normally if you were a former in this case, you'd probably be buying the hens in order to produce the eggs. And therefore you would usually plot on the x-axis the independent variable of the hens. Note that if I reversed these, I put the hens over here on the y and the eggs on the x. We would still have a positive correlation. It's not like the graph would flip as you might think, you know, if you flip the x and y. But by tradition, we usually put on the x what we think is the independent. All right. A negative correlation, one variable increases while the other decreases. So in this case, we're talking about age and batting averages. So in this case, the age of a baseball player is going up. And as the age of the baseball player is going up, the batting average goes down. Now note that this correlation is pretty not significant, but it is downward sloping correlation. Also note that we probably would think about this the other way around. If I was making a hypothesis, I would probably put the x as the age indicating where we normally put the independent variable, which is causing the batting averages to go down. That would be my hypothesis that I would be putting in my head. But if I reverse these two this way, we would still end up with a negative correlation. So whichever variable we put on either axis, if there's a negative correlation, it'll still be negative. And if it's a positive correlation, it'll still be positive. But by tradition, we'll typically put the independent or what we think is the causal factor on the x. Okay, correlation coefficient, we usually represent it with an R. Here's the formula for it. When we think about the formula, it looks complex. But given some of the sections we have seen in the past, it's not really that difficult. We have two different data sets right now. And in prior sections, we talked about the z score in prior calculations. That's going to take in one data set each point minus the mean divided by the standard deviation. That's like the z score. We do the second for the second data set, we do the same thing. Each point minus its mean divided by its standard deviation for a sample. And then we divide by n, which is the count of data sets that we have. In this case, n minus one with that added factor we typically have for like sample calculations that we've seen in prior presentations. So what's this mathematical calculation going to do? It's going to give us a range from negative one to plus one. So when we think about this correlation, we saw it pictorially in the prior presentation. We can also represent it mathematically with this calculation giving us a result from negative one to positive one. Now, if it was exactly negative one, which isn't likely to happen most of the time, but if it was in that extreme example, it indicates a perfect negative correlation. So we'll do an example of this just to show the extreme when we do our practice problems and what example would be say the distance traveled versus the distance remaining. So if you're going on a trip that is 100 miles and you traveled 20 miles, then it would go from 100 down to 80, right? And if you went to 40 miles that you traveled, then the distance that's remaining would be 60, right? And if you went 60 miles, the distance remaining would be 40. And so you can see how you have that negative kind of correlation. Now, in that case, you would, again, think of the distance traveled normally as the independent variable, but you could flip them, you could think of the distance remaining and say, well, and look at it that way, you would still get a downward sloping line. But obviously in this case, if we were mapping this out, we would probably think the thing that we are doing is traveling and that's causing the distance remaining to be dependent upon that. So one indicates a perfect positive correlation. So now we have them going up. In this case, we're comparing feet and inches. Any kind of conversion will have a perfect kind of positive correlation. So obviously if we said, for example, that we had one foot is that one foot would be 12 inches, right? So if we went up one foot, we'd have 12 inches. If we went up two feet, we'd have around 24 inches. So this is just showing a conversion. Now on this one, note that you don't really know whether or not you should put the feet or the inches on the X or Y. Is one causing the other one? Not really. It's just a conversion, right? We're just measuring the lengths using different scales. So this is showing a correlation, but it's hard to know if there is a cause and effect relation. No, we're just kind of defining the lengths differently using different units. And then a zero means no linear correlation. So in this case, we have a bunch of data dots here, but when we draw a trend line between them, it's almost perfectly flat. So a perfectly flat trend line would indicate that there's not a correlation between them. Now remember that most data sets that we have is going to be somewhere in the middle. We're going to see a perfect positive or perfect negative. We're going to see the dots trending positive or trending negative, and then we can see a trend line between them. And most data points might not be perfectly have a perfect zero because even if they were randomly chosen, you might have a little bit of a correlation even basically from the randomness. So these are the extremes, negative one, one and zero that we don't really expect to find in most of the things that we're going to apply this to. But we will do some examples of those extremes so that we can see what the border looks like. So this is something that would be more likely that we would see something like this. This is with heights, heights and weights of individuals. So in this case, if I was thinking about heights and weights of people, for example, measured in inches and pounds, then you would think that the heights I would hypothesize that if someone is taller, that would tend towards a higher weight. That would be my hypothesis, right? I would think if there's a cause and effect kind of relationship, taller people will tend to be heavier on in general. And so if you plot that out and we see this, we see that that does indeed look to be the case. All the dots do not fall exactly on the line, meaning I can't exactly predict what someone weighs by their height. But it is the case that if someone is taller, I would tend to think that they're going to be weighing more than if they're shorter. But that's not always the case, right? We could have somebody that's like, if this is measured in inches, so 63 inches, 63 divided by 12 would be 5.25 feet, and they could be like 200 pounds or 180 pounds, right? That could happen because you could have a very heavy, shorter individual. They would be kind of an outlier on the trend line, right? But the general line is going to go like this. So it's not going to give us a perfect, the height of someone is not going to give us a perfect estimate of the weight, but we can give a linear approximation of what that would be. That's typically what we're trying to do here. So ice cream sales and temperature would be another example where you would think that as the temperatures go up, you would plot how much ice cream you're going to have. So now temperature is going up and you plot ice cream sales, you would think that as the temperature goes up that you would have more ice cream sales. That might not always be the case because you might have had a cold rainy day that had like a festival next to your ice cream shop or something, and you sold a lot of ice cream even though it was cold. But in general, you would generally think that would be the case, right? Purpose of scatter plots. So show the relationship between two variables. So we're back to our hands and our eggs. If we plot these two things together, we can show the relationship. Now, obviously intuitively, if I was a farmer, I would have a pretty decent sense that hands are causing the eggs, right? That more hands means there's a relationship between the two, but each point represents a pair of data. But if I plot them, then I get a better sense of exactly what that relationship is. And then I can start to make decisions like how many hands would I need if I want so many eggs by giving myself a linear kind of equation that I can put calculations in. So identifying patterns. So linear patterns indicate potential correlation. So this one's the height and weight again. Now, with height and weight, you would think pretty confident that you make a hypothesis that there is going to be a cause and effect relationship. If someone is taller, they're going to have more mass. They're going to weigh more typically everything else equal. So if I plot that, you could see that pattern. Now, again, there could be things where you're just combing through data and you see a pattern that's positively correlated like this. And there is absolutely no rational reason as to why it would be. It just happens randomly that happened to be correlated. And so that's what we have to be careful with the correlation equal in causation. But the purpose of seeing the correlation is to try to then draw the conclusion as to whether there's a cause and effect relationship. And if there is a cause and effect relationship, try to nail down what the cause is to the degree that we can or how causal it is. And see, no pattern might suggest no correlation. So in other words, if I plotted this out, these two data sets, whatever those might be, right? And I got this set of points and then I tried to draw a trend line and I get no correlation or possibly a very low correlation. Then that's going to indicate to me that there isn't a cause and effect relationship. In other words, the point of doing the correlation is to try to see if there's a cause and effect relationship. And if we get a correlation, then it's quite likely that there might be a cause and effect relationship. Then we have to drill down and say, well, is there a cause and effect type of relationship? It's not necessarily the case. But if there is a cause and effect relationship, then you would think that you would have to be able to find some kind of correlation. Whereas if you find a correlation, it doesn't necessarily mean there's a cause and effect relationship. But if you find zero correlation, then you would think, at least with those two variables and alone, that there's not a cause and effect relationship, right? Because if there was a cause and effect relationship, you should be able to find some kind of correlation. Because if you find the correlation, it's not necessarily the case that there is a cause and effect relationship. Now you could have a more complex situation that maybe you need more variables. Maybe if you look at it through a multiple variables that there's some kind of relationship that happens. But again, if you have a low correlation, that would generally indicate, okay, there's not a cause and effect relation the way I have it laid out here. So why use regression? So to make prediction based on the relationship between variables. Again, with the hens and the eggs, why do I do the regression? Well, if I just have these dots of data points, I'm not going to be able to answer a question like, how many hens do I need to buy in order to produce so many eggs that I'm going to sell in the future? But if I can get this line, if I can draw a trend line, then I can make a general prediction, right? Like this first hen, for example, this is hen, three hens made around 100 eggs in a year, I guess. And then five hens went up to like 175 or so. And then when I went up from five to six, that last hen was kind of a slacker. We got a slacker hen, not that any kind of egg production, I think, is tough work. It's not a job I would want to do. But I noticed that the other hens, they made more eggs than this one, hen number six, I noticed. But then that last one, then going from six to seven, you had a high producing hen over here. So I can look at the trend line and say, well, how many hens would I need to produce so many eggs, right? All right, so then we have a simple linear regression using one independent variable to predict the value of a dependent variable. And that's what we're going to focus in on in our practice problems where we have the independent variable, in this case being the hens, which is going to give us the predictive power over how many eggs are going to be produced basically in this example. So residuals, the difference between actual and predicted values. So the line representing our predictive values, here's the actual values on the data points and the residuals are the differences. Our goal here of a regression is to minimize these residuals. So that's the least squares method. In other words, this line that we're putting between these data points is minimizing the differences between the predictive values and the actual values. So then multiple regression. So we can get more complicated in this, of course, when we have more than one independent variable. So we might come to the conclusion that, hey, look, looking at just these two factors, it's a more complex system than that. You can't come to the proper conclusion just looking at those two things in some systems. So then we might have multiple regression advantages provides a more accurate model by considering multiple factors. Example, predicting house prices using factors like size, location, age of the house, and so on. Notice if you think about this scientifically, for example, what do we try to do scientifically when we're trying to prove something? We try to remove all the variables. We try to go into our lab and say, I'm just going to put this one atom together with this other atom and see what happens with everything else removed. And that gives us a really good cause and effect kind of relationship with high predictive power as to whether or not one thing is causing another thing. That's going to be like a one to one comparison. Any time we can do that, that would be great because then we'd be able to, you know, that's what we tend to try to do to isolate things to see a cause and effect. But obviously when systems get more complex and the real world when we're looking at systems, the combination of things could result in different results that you cannot get to. You can't because the combination becomes to some a different result than if than the individual parts, right? So now you have, so that means that you have to look at some systems you would think using multiple factors to see if there's a cause and effect relationship. Now just realize obviously that if you have a one to one comparison, it's much easier to measure to see if there's a correlation than if you have multiple things going on. So the level of complexity will will expand greatly once you go to multiple regression type of analysis and your models will be a whole lot more complex. That's what you have to do. But again, the the the the the surety that you have that your model is producing the right stuff becomes way more complex as you as you introduce more items into the into the factor. So for example, when you're predicting house prices, you're going to use things like the size, the location, the location and age of the house. Now obviously with a house, if you're trying to think the price of a house, how much should you sell it for or how much should you buy a house for? We always hear the mantra location, location, location, right? But that's really only one factor because the size of the house is going to be a factor as well. And the age of the house is going to be a factor. So when you're trying to come up with a model to predict the price of a house, then it's going to get quite complex quite quickly. And your model is probably not going to be perfect due to the fact that every house is unique, you know, right? Every house has its own location, has its own age and size and so on. So you can try to come up with models that give you predictive power and you can come up with complex models based on on multiple regression analysis. But again, as you do that, obviously your models are going to get much more complex. Okay, so correlation does not equal causation. There's our common phrase used often as a mantra. Why is this important? Misinterpreting correlation can lead to wrong calculations and misguided actions. So it becomes quite important because when people see a correlation, we want to, as human beings, determine that there's a cause and effect relationship. And if we do that and get it wrong, then we're going to be taking action on the wrong data. So remember that if there is a cause and effect relationship, we should have a correlation, right? There should be a correlation if there were a cause and effect relationship. But if there is a cause and effect or if there is a correlation, it doesn't necessarily mean that there's a cause and effect relationship. It's kind of the first step that we would do, mathematical calculation to see correlation to then make the prediction as to or try to give validation to a hypothesis that we might have had already that there's a cause and effect relationship, right? But it's not the final factor. So specious correlations, relationships that seem meaningful, but are not due to coincidence or external, that are due to coincidence or external factors. So if you're looking at a whole bunch of data sets, it's possible that you will find correlations that just happen basically randomly, right? There's no reason that the things should be moving in alignment with each other, but they are. And when you find those and you realize that they're specious, it's kind of funny because then you can come up with scenarios, well, how would that be that these two things will line up? Because obviously they don't. That's why it's funny. But so relationship between ice cream sales and shark attacks, for example, if you just looked at all the data and for some reason you're looking at these two data sets and as ice cream sales go up, shark attacks go up, well, probably, they're probably not related. Now you could imagine a scenario, maybe they're somehow related in some way because ice cream sales went up because it's hotter, it's hotter and more people are at the beach. I don't know. You can kind of try to figure out if there is some kind of link between the two. But the point here is that there might not be a link between the two. That might be a completely useless exercise because maybe there is no cause-and-effect relationship. It just happened to randomly come up that these two things that are totally not related. I mean, you might be talking about like ice cream sales somehow in the middle of the country, not anywhere by the ocean or in the middle of any land mass versus shark attacks, which you would think would only happen at the beach. How would that be correlated? Can't possibly be correlated, but I mean, it can't possibly be a cause-and-effect relationship, even though you might have this random correlation kind of thing. Number of people who drowned by falling into pool and films Nicholas Cage appeared in. So again, those things would be completely not correlated. If you looked at those data sets and you're like pooled round-ins, somehow or correlated mathematically to the number of films that Nicholas Cage appeared in. I mean, does Nicholas Cage cause people to not swim well or want to jump in a pool even though you know that doesn't make any sense, right? But again, if you just looked at enough data randomly, you will find correlations like that. So that's why the phrase comes into play, correlation does not necessarily equal causation, even though correlation is an important step to try to determine if there is causation. By the way, the next question, of course, if we determined that there is causation, is to make sure that we have the causal factor correct, because the next question would be what is the causal factor and the cause-and-effect relationship and the other common problem and sometimes manipulation that people make when they're being dishonest is to reverse the cause-and-effect relationship between the data points. And if you're acting on something based on the concept or idea that there's a mathematical relationship or correlation and that there's an assumption that that's due to a causation, but then you flipped what's the causal factor in that relationship, that could lead to the wrong action, right? You need to get the causal factor correct as well.