 statistics and excel perfect positive correlation got data let's get stuck into it with statistics and excel first a word from our sponsor yeah actually we're sponsoring ourselves on this one because apparently the merchandisers they don't want to be seen with us but but that's okay whatever because our merchandise is is better than their stupid stuff anyways like our trust me i'm an accountant product line yeah it's paramount that you let people know that you're an accountant because apparently we're among the only ones equipped with the number crunching skills to answer society's current deep complex and nuanced questions if you would like a commercial free experience consider subscribing to our website at accountinginstruction.com or accountinginstruction.thinkific.com you're not required to but if you have access to one note we're in the icon left hand side one note presentation 1710 perfect positive correlation tab we're also uploading transcripts to one note so that you can go into the view tab immersive reader tool change the language if you so choose be able to either read or listen to the transcript and multiple different languages using the time stamps to tie into the video presentations one note desktop version here now thinking about correlation where we have different data sets to see if there's a mathematical relationship or correlation between them in other words are the data points in the different data sets moving together in some way shape or form if there is a mathematical relationship or correlation between the different data sets the next logical question would be is there a cause and effect relationship that's causing the correlation or mathematical relationship and if there is a cause and effect relationship the next logical question would be what's the causal factor that's causing the cause and effect relationship which is causing the correlation or the mathematical relationship between the data sets we're going to start off by thinking about a perfect positive correlations something you don't typically find when looking at correlations because usually we're looking at two different data sets to see if there's a trend that is moving in the general direction but we want to start off with that perfect situation which one example would be when we had conversions so if you were to convert for example inches to feet then you're going to have the perfect correlation between those two different data sets so that's what we'll construct first to get an idea now to do that we're going to give i'm going to imagine that we're going to give excel the information to generate our our first data set how could we do that we could do that by just giving random numbers we can ask excel to give us just random numbers and we've seen in prior sections that if we were to do that give us random numbers that we're going to say our inches for example uh excel then if we were to graph that it would be going towards like a uniform type of distribution but for our first example i'm going to give our first data set in inches and we're going to generate the data that's going to be random but in alignment with a bell curve type of distribution if you want to see how to do that you can do this problem in excel so that we're going to have our inches data here this is the inches data that we generate in excel it's going to align more or less to a bell curve has some randomness but aligns to the bell curve again the fact that it aligns to a bell curve is not going to be the factor which will be determining whether or not there's a perfect positive correlation or not and that's one thing that we kind of want to take a look at we'll look at examples in the future where we will do a random example it won't correlate to a bell curve but rather correlate to more of a uniform distribution so uh then we're going to round it so i'm going to take this data and we're going to round it so here's our kind of randomly generated numbers i'm going to round it to the nearest inch so we're going to imagine our random generated numbers 16 inches 21 inches 36 inches 54 inches 52 inches and so on and then we'll convert that to feet so if there's 16 inches and we say there's 12 feet and an inch we're going to say all right 16 divided by 12 is about out 1.33 it should be and then we're going to say all right 21 divided by 12 21 over 12 is around 1.72 1.75 because of the rounding so i'm going to say there it is so there's going to be our feet information so i'm going to say okay so we have that and notice if i get more specific of course if i took this 15.60015 divided by 12 you get closer to that 1.3 so they're still rounding involved but we rounded it to the two digits on the inches now clearly we can see here because this is a conversion that there's going to be a relationship between these two data sets because of the converting factor of 12 inches in a foot but if we imagine that these two we don't know exactly what they are and we're trying to see if there's a correlation then we can imagine that from this point going forward so now we're going to say okay then and by the way we generated this random set of numbers by going to the data if you were in excel and using the analysis which you can turn on in excel if you don't have it and the data analysis tool and then we chose the random number generation but we set some parameters around it for giving us the bell type curve generation which is here so we set some parameters for a normal distribution and that's how we generated our first set of numbers okay so then if we calculate say the mean and the standard deviation so we're taking our generated numbers over here and inches and feet and picking up the mean and the standard deviation so let's see if i can do this i'm gonna open this up and then we've got our mean and standard deviation in inches the mean is simply the average so we would just be taking all of this data right here and then divided by the number or the count and then the standard and that's going to be this formula if you were to do that in excel standard deviation we're taking standard deviation of the sample standard div dot s and that's going to be all of this data in this formula and then we do the same thing for the feet the mean of the feet here average standard deviation of the feet of this data set we get to here now of course if i was to compare this data if i didn't know that there was a relationship and i look at those two i can take the difference between the mean and the standard deviation it's not going to tell me a whole lot right it's not going to tell me if there's really a correlation between them so the next thing we can do is we could say well what if we were to plot out the inches now note these are the inches data so if i took all of this data and made a histogram of it it looks like this so you can see that in this middle point you've got around 33 the top bit of it and that's around the mean that we generated the data set with and you can see that it looks somewhat like a bell shaped distribution with a lot of the data in the middle tapering off to either side that's because we put the information in there to give us a randomly generated numbers that would be around a bell shaped distribution giving excel the middle point the mean and the standard deviation to work with now if i did the same thing for the feet i can see i don't have the same buckets because now that middle the middle mean point is here but it looks similar so the fact that these two look kind of like a bell curve distribution does not necessarily mean that there's a correlation between them but that in and of itself if we plot these out themselves might give us some idea depending on the data set we're looking at of what is going on so that's just so we'll say hmm they both look like they've got like a bell shaped curve you know that is interesting and then we can then say what if I plot this out now you in order to look at the correlation you don't necessarily have to plot out if it was a bell curve but i want to take a look at the plotting it of a bell curve so that we can look at the bell curve data as well as the actual data in terms of the correlation so if we were to plot this out in terms of a perfect bell curve like we saw in prior sections we need the x and we need the z i'm sorry we need the x first and we need the lower point and the upper point so if we're thinking about inches you might say well i can go from zero on up to infinite amount of inches if i wanted to plot out say a normal distribution but it would be better it might be easier if i look at the range in terms of we usually use the four standard deviations because that would include a large amount of the range so if i was to calculate that i can get to the lower x and then the upper x so how can i do that i can say well the standard deviation is going to be if i look at over here the mean for the inches is 34 and the standard deviation is 16 so if i take 16.29 times four four standard deviations minus the 34.92 that's where i get this negative 30.24 and on the upper side i can go from 16.29 times four plus the 34 34.92 and i can plot out my x's from 30 around to 100 or so and i should have enough information to plot a graph and then i can do the same for the feet so on the feet side of things i could say okay the the mean is uh is that and the standard deviation is 1.36 times four four standard deviations minus the 2.91 that's going to give us our 2.53 about is going to be our lower point and then i can say and it's negative and then i can say okay let's take the 1.36 times four standard deviations above the mean plus 2.91 gives us that 8.34 okay so then if i was to plot this what do we have here that's my calculation so if i plot this i have my x's for inches at negative 30 going all the way up to if i have all the data pointed here 100 and then i can i can do my perfect distribution using my norm dot dist calculation which looks like this norm dot dist of the x this number the mean over here the standard deviation uh the standard deviation and the mean are over here and that's going to give us then our p of x so this is our p of x for the normal distribution and then i can calculate my z and the z is calculated as taking each number let's take a positive number it might be easier to calculate let's go let's do one down here so that we could see it a little bit more easily i can take like a positive number here and i'm going to say that's that one this one minus the mean and the mean is this 34.92 about divided by the standard deviation of 16.29 and that gives us the z of the 2.08 about and that z score becomes important when we're trying to be comparing two different data sets so so the z is kind of what we're focused in on here so there's the z and i can do the same thing for the for the feet so on the feet i have the feet from negative three up to eight and then i do the norm dot dist for the feet calculation norm dot dist the x the mean the standard deviation and it's not going to be cumulative and that'll give us our percentages and then we can take the z score for the feet let's look at this three feet for example i could take the three minus the mean which is over here of the 2.91 divided by the 1.36 and that gives us our 0.67 about so so there we have that and then okay so let's take a look at that in a second now just looking at this intuitively before we actually do the calculation for the correlation note if i was to look at these two data sets and kind of compare the z scores let's go down to this the positive numbers are easier to see if i look at this number 12 then i can't really compare that number 12 inches to the feet if i didn't know the conversion but what i could do is say well the z score is at the 1.41 and if i go over here and i look at the same uh z score 1.41 that correlates to one foot so i can compare these two right and i can do the next one i can say okay well what about if i go back up to here 24 has a z score of 0.67 right and i go all right over here if i look at 24 i have 24 is at the 0.67 z score i have a 2 so i can so you can see the pattern here of course by looking at the z score if i didn't know that there was that what the pattern was i can match up all the z scores and i can start to see you know what the difference is the conversion of you know the 12 feet to the inches so i can so i can then say okay i these are the z scores these are these uh two and i can look at the differences between the one and the 12 and the two and uh the 24 and i can start to see a pattern there and that's basically the z scores are basically a big part of what we're doing with the correlation so when we think about the correlation here's the formula for it which looks pretty kind of intimidating however what we're doing is we're taking the two z scores which is just what we did here we're taking each data point minus the mean divided by the standard deviation the z score for one of the two items minus the z score for the other multiple i'm sorry times the z score for the other multiplying the two z scores together we're doing that for each data point summing it up for each of the data points and the data sets and then we're dividing by n minus one because it's similar to like a sample situation so the n minus one which is the count of the data sets minus one so if i was to plot this then we can plot this out using a a excel i can just take our actual data sets let's go back on over to the actual data and if i was to just highlight these two columns here and insert a graph on it which is going to be the scatter then i would come up with i can come i would come up with something that looks somewhat like this where we have this perfectly positive correlation now when i highlight something in excel and input it the one on the left which in this case was the inches will automatically be the x and then the y is going to be the feet so you can see it's a perfectly positive correlation because as we increase the inches the feet increase in proportion and we have that perfectly sloping line and that's why we have the perfect positive correlation now note that you could because we're talking about a conversion here you might say okay well i'm not sure like what's the causal factor we might not always know like what should go on the x or the y you can look at it either way because you can see the feet over here as the feet goes up the inches go up so what is the driving factor you may not always know what the driving factor is and here's just a conversion so you can kind of flip it either way and see which one is going to be thought of as the driving factor but usually usually if we know we would put the independent variable on the x and then the dependent on the y if you were to flip this and say i want to flip it the other way around now i have the inches on the y and the feet on the x so you can do that in excel by just flipping that around so it's a little bit technical to kind of do that we'll do that in excel if you want to check that out you might imagine by doing that that you would go from a positive correlation to a negative correlation meaning you might think well it's going to be an upward sloping line and if i flip these two it's going to be downward sloping no it's still going to be an upward sloping line we still see it's going to be positively correlated so you know you could put either of these on on the axes and you're still going to end up with that basically the positive correlation okay so let's go and calculate this using this formula now and say let's calculate our correlation so to do that we're going to get the z and inches and and that'll be our first thing so we've got the this is going to be our inches now we want to get the z and inches now this is our actual data we are now not talking about the the perfect distribution over here we did this perfect bell curve distribution to compare the z's in a perfect distribution kind of situation now we're looking at our actual data same idea though we're going to be taking the z score of our actual data the first one being 16 it would just be the 16 minus the mean and the mean is over here at the 34 so minus 34.92 divided by the standard deviation that's going to be divided by 16.29 is the negative 1.6 uh hold on i did something i think i did something wrong let's do that again 16 minus the mean which is going to be minus 34.92 divided by standard d 16.29 1.16 so i maybe it must be rounding that is the difference here so yeah i think it's going to be a rounding difference because we rounded these inches down so but in any case there's that we'll do this in excel if you want to check it out there as well and then we do this for all of them we do that for the feet as well with our actual data same calculation we saw before but with the data uh i'm sorry there's the feet and then the z score would be taking each of these data points for the feet and giving us the z same kind of calculation and then this is multiplying out the two z scores of the two different data sets so now this is going to be the 1.19 times the 1.19 has given us the 1.41 so at this point then we've taken each of the data sets each of these points for each data set minus its respective mean and divided it by the standard deviation so now to get the numerator all we have to do is sum up this last column and we'll get this whole numerator so that's what we're doing over here we sum up the column we get to 309 that's this last column that's our numerator then we're going to be subtracting out or dividing the denominator which is n minus 1 n represents the count all the all the numbers in our data set there's 310 in this case and then less 1 less 1 minus 1 gives us the 309 here and then we got the numerator and denominator and if we divide out 309 divided by 309 we get to 1 which of course indicates that we have a perfectly positive correlation as we can see here with that perfect positive correlation so let's take a look at a way we can do that easier you can also do this with the data analysis toolkit and you can double check your numbers now notice that when you use the toolkit it's great because it'll give you it'll give you it'll kind of spit out the number but it's not dynamic because when you build something in excel you have the formulas and if your data change then you can change the cells with it so and it's also nice to actually map it out sometimes like this because that gives you an idea of the z-scores it'll give you a better understanding of the data so so oftentimes you might use the data analysis depending on what you're doing to give a precursor information before building a more comprehensive worksheet depending on whatever you're doing or in our case like we're doing here kind of like as a double check so if I was to do the data analysis and I go into this correlation in the data analysis then I enter the information here what we need is the input range that would be all of our data we have to have the data together to use this so this would be the the original data that we did over here these two these two columns of data that would be the range we put into the data analysis and then I would have to click this off if I chose the labels and then we tell excel where we want to output it and that's all we need to do and excel will give us this little table this one right here which is saying inches rounded and and feet so inches here feet here we have the one which represents the perfect correlation so the ones representing a perfect correlation so then I can also use the data analysis tool pack to spit out this kind of data as well which is going to be the descriptive statistics and so this gives us another look at all the data so all I would do here is select all of this data these two columns and then tell excel using this the analysis toolkit to give me the descriptive statistics again excel doesn't give me something that's dynamic meaning these cells won't change if we change if the data set changes but this is gives you some nice information to double check or to give a preliminary look at the data you're looking at which might help you to get an understanding or idea of what's going on so you can see the mean versus the mean the standard error the median the mode the standard deviation the sample variance not all of these we might talk more about them later but the kurtosis the skewness the range here's the maximum and the men and summing them up here's that count that we came to the 310 and the confidence level isn't something we're really focused on at this time but I just want to point out that tool because it could be a great tool to look at so so quick recap when we have two different data sets we might want to see if there's a mathematical relationship or correlation between them or the data points moving together in some way shape or form some tools we have to do that is a mathematical formula we can also see it pictorially by mapping out the information and usually we will not have this perfect correlation but we might see a trend in the data we can then draw the trend line which can help us to visualize whether or not there might be a relationship between the data sets now note in this particular data set we wanted to map out the original information randomly but using a normal distribution which means that each of our data sets have this kind of distribution that look like this this normal distribution that fact that it has a normal distribution remember doesn't necessarily mean that there's going to either be or not be a correlation to it but that allowed us to then think about a situation where we can map out the actual perfect or smooth bell curve using the norm dot dist function and then compare those z scores the z scores are kind of the key to the calculation those are the things that allow us to do the comparison as we can see in the formula here now in future presentations we'll do some more data sets which will not conform to say a bell curve for example and we could still have situations where there's a correlation between them