 Statistics and Excel correlation large data sets focus on Z-score relationship. Got data? Let's get stuck into it with statistics and Excel. First, a word from our sponsor. Yeah, actually we're sponsoring ourselves on this one because apparently the merchandisers, they don't want to be seen with us. But that's okay whatever because our merchandise is better than their stupid stuff anyways. Like our crunching numbers is my cardio product line. Now, I'm not saying that subscribing to this channel, crunching numbers with us, will make you thin, fit and healthy or anything. However, it does seem like it worked for her, just saying. So, subscribe, hit the bell thing and buy some merchandise so you can make the world a better place by sharing your accounting instruction exercise routine. If you would like a commercial free experience, consider subscribing to our website at accountinginstruction.com or accountinginstruction.thinkific.com. You're not required to but if you have access to OneNote, we're in the icon. Left hand side OneNote presentation 1750 correlation large data set focus on Z-score relationship tab. We're also uploading transcripts to OneNote so you can go to the View tab, Immersive Reader Tool, change the language if you so choose. Be able to either read or listen to the transcript in multiple languages using the timestamps to tie in to the video presentations. OneNote desktop version here thinking about correlation, having different data sets to see if there's a mathematical relation or correlation between them. In other words, are the data dots and the different data sets moving together in some way, shape or form. Now, if there is a mathematical relation or correlation between the different data sets, the next logical question would be, is there a cause and effect relationship cause in the correlation or mathematical relation between the different data sets? And if there is a cause and effect relationship between the different data sets, the next logical question would be, what's the causal factor that's causing the cause and effect relationship which is causing the correlation or mathematical relation between the different data sets? So, in prior presentations, we thought about perfect correlations in a positive direction as well as perfect negative correlations, not things that we see oftentimes in practice because we're usually looking at trends with the correlation. We also noted that correlation like any other statistic is something that we want to have as one type of tool when we look at our data, be able to look at the data in other words from different angles as well. This time, we're going to have some data sets that are going to be a little bit longer and we'll do some added focus in on the Z score relationship, which is going to be a primary component of the calculation of correlation. Here's going to be our data sets, the height measured in inches and weight measured in pounds. Now, our data set of course here, we didn't include the entire data set. If you want to look at it in Excel, we'll have the entire data set there, but it's got more data in it than we have seen in some of our prior presentations. Noting that when we first look at these, of course, like with any data set, we might first come up with our assumptions as to what might be the case. So if I'm looking at height and weight, then I might say, well, these have to do with nature, for example, so I think maybe there's a bell curve related to these items. And I also might then think, well, if these things might be correlated, because of course I would think that the independent factor, the driving factor, would be the height, which might cause the weight to be higher, because if you're taller than you would have more weight, might be some assumptions that you might make some theories about the data going into it. So in any case, there's going to, then we're going to have our data on the right-hand side. Now first, let's do a histogram. So if I did a histogram of the heights, just selecting the heights and entering a graph and Excel a histogram, then we can see that in the buckets in the middle, 68 to 68.33, 68.1 to 68.33 seems to be the middle point. It seems to taper off looking very bell shaped as we would kind of expect with something that's nature related, measuring things like lengths of animals or humans or heights or weights and that kind of thing. If we did the same thing for the weights, we have a similar result in that we have different buckets, of course, down here corresponding to the weights, but it kind of looks like a bell type curve, which is what we might expect when dealing with weights. These being measured, of course, independent, noting that as we've seen in the past, the fact that they both look like a bell curve doesn't necessarily mean that there's a correlation between them. So we've seen examples in the past where our data sets seem to be going towards more like a uniform distribution versus a bell curve distribution and so on. And it doesn't necessarily mean, no matter what the distributions are, that they're going to be correlated, but it might give us some other insights about the data which might further strengthen our hypotheses as to whether they be correlated or not. So then if we go on over and say, let's do our calculations for the correlation, which of course looks like this, we're going to take each of our data points minus the mean divided by the standard deviation, multiply together, sum them up, divided by n minus 1. Now with this data set doing this manually is very tedious because we have quite long data sets here. But with Excel, not too bad, and we can also use the tool which we'll do in a second here to analyze this more quickly with the Excel's data analysis tool. So if we take the mean of the two, this is just simply taking the average of this data set, the heights, and then the weights data set. We get these two numbers, 67.99 for the height and 127.08 pounds for the weight. The standard deviation, if I take just simply these two data sets and take the standard deviation of a sample for them, I get the 1.9, the 11.66, this measuring the spread. Now just looking at those data points, it's not going to really help me most likely to see if there's a correlation between them. But that'll help us with our correlation calculation, obviously. So here's our average calculation. Here's our standard deviation calculation in Excel. So then I'm just going to map out our data sets. Here's the height, and we can just take each of these data points and calculate the z score now. So this is the top part of our formula. We're going to take each point starting with 65.78 minus the mean 67.99 divided by the standard d 1.9. That's going to give us the 1.16 negative about. We can do this all the way down. The next data point would of course be 71.52 minus the mean of 67.99 divided by the standard d 1.9. And that's going to give us 1.85 about. We can do that all the way down. We can do that all day, but we don't have to do it all day because we did it in Excel, which means it took a very small part of the day, which is nice. That's what Excel is for. That's what it does for us here. So then we can do that with the weight as well. We can say, okay, here's the data point, 1.12.99, calculate the z, subtracting minus the middle point, 127.08, or the mean or average in other words, and divide it by the standard d, which is the 11.66 to see what it will be. We get the 1.21. We can do that all the way down, of course. Let's do one more for the fun of it because I'm having a good time with this. 136.49 minus the mean, 127.08, divided by the standard d, 11.66. And that's going to give us the .81 about. And we do that all the way down. Once again, we let Excel do it all the way down. Otherwise, we would go crazy as people must have done before Excel. Everybody before Excel must have been crazy. They're all crazy people. So if we multiply these together, we get 1.4. And then if we multiply these together, the z's are 1.85 times .81. That's going to give us the 1.49. So now we're just multiplying together each of them, the two z's, and that'll give us all the z's. Now we just sum that up and it gives us the numerator of our calculation for the correlation. Let's do that over here. We're just going to sum up this outer column that comes out to 12,570.96. The denominator is going to be n minus 1. Now this is how many rows were in this data set. So that's telling you how long the data set was. It's 25,000 points on the data set, which for us is a lot. That's more than we've seen before. And then we're going to subtract 1 from it, which seems awfully insignificant, but it's not because it's our correlation calculation. That gives us the 24,999. And then we have the numerator and the denominator. If we divide the two out like we're supposed to do when you have the numerator and denominator, you take the top 1,1,2,570.96 and divide it by the bottom one, the denominator, 24,999. And that gives you the .5 correlation. So we have a correlation. Of course, it's not a perfect correlation, but we've got the .5 correlation. Now we can double check that in Excel by going to our data tab, going into our analysis, using the data analysis. If you didn't have that in Excel, which it's not on by default, you can turn it on by going to the options. We show you how to do that in the Excel practice problem. But then you can just click on the correlation. You can then go to this correlation, select your data ranges, which would be the two data sets that we looked at for the weight and height. They have to be together, like this, side by side in order to select them here. So you just pick up the ranges, put it somewhere on the output on where you want it to go, and it'll spit out the answer. So you've got the height and the weight, .5. So it's not a dynamic calculation. It doesn't move when the data set changes, but .5029 gives us your double check or a preliminary look at your data. All right, let's do a little bit more analysis on... The other thing we could do is you could use our data analysis and you could use the descriptive statistics, which will spit out the descriptive statistics as well. So you'd put in the range again, and then I selected the summary statistics and the confidence levels, and it'll spit out your statistics like this. So for our two data sets, we have the height and the weight. Now again here, these are not dynamic, meaning they're not formulas. They will not change as you change the data. But this is a great tool if you have your data and you want to give a preliminary analysis and let it just spit out some information about it, which might help you to then do whatever you want to do, like construct your workbook or use it as a double check to check that your calculations are correct in here. So it gives you the mean, the standard error, the median, the mode, the standard deviation sample variance, kurtosis skewness range, minimum maximum sum count, there's that 25,000 again, and the confidence level. All right, let's now focus a little bit more on the Z scores again. So here's our data sets again. We have the height and the weight, and we will note that when we think about the calculation for the Zs for the correlation, let's go back over here and look at it. Where did it go? Donde. These two are the Z scores. So really we're comparing the Z scores. That's the key component that helps us to compare the data sets and do this correlation calculation. So just to get more of an intuitive sense on what's going on, let's kind of focus in on those Z scores a bit more. So I'm going to say, all right, here we've got the mean again. Here's our mean, our standard deviation of our data sets. And now I'm actually going to plot this out because I think it's easier to kind of focus in on the Z scores when we have the perfect bell curved shape. So we noted that both the height and the weight had a bell type shape. So now I'm going to make a curve, which will be the perfect curve that will simulate the bell shape. We talked about this in prior presentations and then we'll focus in on the Z scores from those. Now remember, you don't need to have a bell shape kind of set of data in order to do the correlation calculation. Not all data sets are going to conform to a bell curve and things that aren't a bell curve could still have a correlation between them. But we noted that these two happen to have a bell curve and I think the actual curve will help us to focus in on the Z scores. So to calculate the bell, the curve, what I'm going to do is say we're going to take it for standard deviations and plot this out. So we'll have the X or the low and the high for standard deviations. So what does that mean? That means on the low H, we're going to say that the standard deviation was 1.9 times 4 and that's going to get us up to 7.6. I'm going to add that to or subtract it to to get the lower, the 67.99. So we're going to start at the low point at 60.39. And then on the high point, I'm going to say it's going to be 1.9 times 4 plus the middle point, 67.99. So 70.6. So when I look at the height, I'm not going to go down to 0 inches up to like a thousand inches. That would be, you know, too much, too much because all of the data is going to be between for the most part, not all of it, but for practical purposes, all of it is going to be between these two points. And then here, let's do the same here. The standard deviation is 11.66 times 4 minus the middle point 127.08. That gives us the low point, negative, and you're not going to have negative weights, but we might start... Well, no, it's not negative. It's not negative. It's going to be the 80 is the low point. And then the high point, we're going to say it's negative the way I calculated it, but it's not negative because it's going to be that minus the standard deviation times 4. Okay, so then let's do it. The other 11.66 times 4 and then plus the 127.08 and that's going to give us the 173 on the upper point for the weight in pounds. All right, so if we plot out the bell curve, which we focused on in a different section, but we'll plot that out here, so we'll do it. You've got each data point and then the bell curve is going to be the norm.dist function where we take the x, which is this, comma, the standard deviation for the height is that and then the... Is it going to be cumulative or not? It's not going to be cumulative for this calculation and that'll give us, for each height that we picked from 60, I started at 60, the low point, up to 76, I went to, up to 76, it gives us our percentage, the percent likelihood that the data will be there. And so there we have it and then we can calculate the z. So the z scores then, same kind of calculation for the z score, I'm going to take each point, in this case the 60, 60 minus the middle point, which is going to be the 60.39 divided by the standard... Sorry, that's not the middle. Let me do that again. 60 minus the middle point, which is 67.99, divided by the standard deviation, 1.9 means that it's over four standard deviations away, which makes sense because four standard deviations away, as we just calculated, would be the 60.39 and we're at 60, so we're over four standard deviations away. That's what the z score is telling us. And we can do that same calculation for the z score all the way down. It's going to be in order and it flips to be positive at the mean. So an exact zero would be at 67.99, so we don't have 0.99 here, but it's close to zero right there, right? So that's going to be that one and then we can do the same thing for the weight where we take each point and we calculate the norm dot dist, this should be having an in front of it, norm dot dist. And that's going to be the X 79, the mean which is going to be the mean of this data set 127.08 and then it's going to be and then the standard deviation is going to be the 11.66 and then it's not cumulative. And then we can calculate the z scores in the same fashion. Do that all the way down z scores would be this one of course around four away because we're looking at that top one which is 79 minus the 127 127.08 divided by the standard deviation of 11.66 gives us four standard deviations away over that because we went to 79. So there we have it and we can do that all the way down. Those are going to be our z scores. Now of course when I look at the data sets here I can't really compare the height versus the weight because they're different units. What we can compare then is I can see if there's a relationship by looking through the z scores and see if there's a relationship as we compare the z scores. So let's do that over here. So what I'm going to do is I'm going to take my data I'm going to take the one that is going to be this one first with the weights because there's more data points here and so I'm going to put that one on the left this time and then I'm going to put the heights on the right so here's the data set my headers didn't pull over for some reason but here's my height data sets with the 60 and and so on I'm sorry this is still the weight so I'm copying this over but I'm just putting the weight first so this is the same thing as the weight here with the weight the normal distribution and the z score and I'm going to compare that to the height so here's the height data and here's the the norm dot dist and there's our z score for the height okay so now what I did is I tried to match up the z scores so that 4.20 is close to the 4.12 and then the next z score if I go down to the next z score it's at the 3.68 which is close to the 3.87 see what I'm doing here I'm trying to match up the z scores and the next z score is at 3.15 and I tried to match that up I think when I copied this over it got staggered a little bit but I tried to match that up to the relative 3.15 on the left which is actually down here so we've got something like this one so it actually lines up to this one and then I'm taking the difference so I'm saying this is close to that z score let's take the difference between this 90 to 62 and I get a difference of 28 and on this one this z score of 3.68 is close to like 3.69 and so I took the difference of this 84 that's related to this z score 3.69 is 84 minus this one of 61 and that gives me the 23 and then I went down and said okay this z score is close to this is the 2.63 it's getting quite staggered here but 2. it's close to this 2.67 so then I took the 2.67 has the related 96 to it minus the 63 and we have a difference of the 33 so all I'm doing is matching up the z scores as best I can between the 2 data sets and then looking at the difference between the actual measures which are measured in pounds for weights and in inches for the heights and so to see if there's a relationship between the z score and the difference between those intervals see if it was just a direct conversion as we saw with the perfect correlation type of thing the distance, the difference between inches and feet would always be 12 a multiple of 12 so we'd see a trend when we start to be comparing these so I did that all the way down here and then I squished them together so now they're on one so these are all of my differences that I found for related z scores and then if we plot that out this is the plotting actually of the bell curve for the height this is the bell curve for the weight and this is me plotting out this list of numbers which you can see there's a relationship between the numbers you can see kind of a trend from the z scores so the idea here would be that is to get kind of an intuitive sense of the idea that it's that z score which is the thing that helps us to do this comparison and it's the z scores which are a huge part of the calculation for the correlation as we can see over here so it's useful then to sometimes to focus in on those z scores a little bit more as a way that possibly can give you some better understanding of what's actually going on within the data sets