 Statistics and Excel. Correlation and Regression. Got data? Let's get stuck into it with statistics and Excel. First question. What is correlation? Correlation measures the strength and direction of the linear relationship between two variables. And whenever we think about correlation, we have to keep in mind the common phrase of correlation does not necessarily mean causation. However, when a phrase becomes as common as this one, correlation does not equal causation. It often loses some meaning. People often saying it as a kind of mantra without really thinking deeply about what the mantra originally meant. So we'll come back to this phrase here and possibly multiple times through the presentation. But first, a quick recap of what we have done in prior sections and prior sections. We have just tried to describe different data sets using both mathematical descriptions such as calculations like the mean or average, the median, the mode, the quartiles and using pictorial representations like a box and whiskers or histogram. Remembering that the histogram is one of our primary tools to visualize the spread of the data and we're able to use descriptive terms about the histogram to describe that spread of the data on the histogram, such as the data is skewed to the left, the data is skewed to the right. We then thought about certain types of data sets that might conform to approximating a mathematical type of calculation of a line or a curve, like a uniform distribution, a Poisson distribution, an exponential distribution and a bell curve or normal type of distribution. And if we can describe the data with a line that has a function behind it, that gives us more predictive power. Now we're thinking about two data sets or possibly more than two data sets in some circumstances, but we're starting off with two data sets to see if there's some kind of relationship between those data sets. First thinking about that relationship as a mathematical relationship. In other words, are the dots and the different data sets moving together in some way, shape or form. And this is where we get into the differences between correlation and causation. The reason this is important is because with that phrase correlation does not necessarily equal causation, people might come up to react to that in ways that aren't exactly correct. One reaction might be, well, that's so common that I'm just going to dismiss it, I'm not even going to think about it, and I'm going to revert back to what I normally do, what is natural for us to do, which is if there's a correlated activity of data points, we as human beings are hardwired to think that there's a cause and effect relationship. So that's why the phrase came up in the first place to say, hey, no, sometimes this stuff, if you just comb through data, you're going to find correlations that have a mathematical relationship that aren't cause and effect related. And if you know that, those things can be kind of funny, because you go like, hey, look, this is mathematically related and it obviously has no cause and effect type of relationship. But the other result that people might come to is to say that the correlation is meaningless because correlation does not equal causation. So what's the point of doing the calculation if correlation doesn't equal causation? Because what we're trying to do usually is try to find a cause and effect type of relation with the mathematical calculation of a correlation. So note that if there is a correlation, it doesn't necessarily mean that there's a cause and effect relationship. However, if there is a cause and effect relationship, you would think we would be able to find a correlation. So in other words, the correlation is still important because it might lead us to the question of, is there a cause and effect relationship? It's more likely that there is a cause and effect relationship if there is a correlation, but we have to be careful that there's not always a cause and effect relationship. So the general idea would be when we're thinking about correlation, we're looking at different data sets to see if there's a mathematical relation between the different data sets. And if there is a mathematical relation or correlation between the data sets, the next logical question would be, is there a cause and effect relationship between the data sets? And if we determine that there is a cause and effect relationship between the data sets, which is causing the correlation or mathematical relation, the next logical question would be, what's the causal factor in the cause and effect relation, which is causing the correlation or mathematical relation calculation between the different data sets? Okay, so types of correlations. We have the positive correlation where both variables increase together. In this example, we have hens and we have eggs. So these are our two different data sets. We have how many hens do we have and how many eggs they are producing. I think this is like in a year, for example. So in this case, when we plot this out, we have on the X the number of hens. So at three hens, we had around 100 eggs at five hens. We had a little under the 200 eggs. Six hens are at the 207 hens. We have up here getting close to the 350. Noting here that usually when we're thinking about something like hens and eggs, we might think about the hens as the things that is causing the eggs. So I can see a correlation here. Clearly, these dots look like they're tending in a particular direction. And if I was to make a hypothesis about why that is, I would think there would be a cause and effect relationship between the independent factor, the hens, which are causing the eggs. Now note, you could think about that the other way around. If you're a former, in other words, I might say I'm going to see how many hens I need to buy in order to generate so many eggs. But you could think about it the other way around. You might say, well, the eggs are causing the hens and then the hens are causing the eggs, right? So you might try to buy the eggs first and think about it as the independent variable. But again, normally if you were a former in this case, you'd probably be buying the hens in order to produce the eggs. And therefore you would usually plot on the x-axis the independent variable of the hens. Note that if I reversed these, I put the hens over here on the y and the eggs on the x. We would still have a positive correlation. It's not like the graph would flip as you might think, you know, if you flip the x and y. But by tradition, we usually put on the x what we think is the independent. All right, a negative.