 Statistics and Excel. Correlation random number generation example. Got data? Let's get stuck into it with statistics and Excel. First, a word from our sponsor. Yeah, actually, we're sponsoring ourselves on this one because apparently the merchandisers, they don't want to be seen with us. But that's okay, whatever, because our merchandise is better than their stupid stuff anyways. Like our, trust me, I'm an accountant product line. Yeah, it's paramount that you let people know that you're an accountant. Because apparently we're among the only ones equipped with the number crunching skills to answer society's current deep, complex and nuanced questions. If you would like a commercial free experience, consider subscribing to our website at accountinginstruction.com or accountinginstruction.thinkific.com. You're not required to, but if you have access to OneNote, we're in the icon left-hand side, OneNote Presentation 1730 Correlation Random Number Generation Example tab. We're also uploading transcripts to OneNote so that you can go to the View tab, Immersive Reader Tool, change the language if you so choose, be able to then either listen to or read the transcript in multiple languages using the timestamps to tie in to the video presentations. OneNote desktop version here thinking about correlation, having different data sets to see whether or not there's a mathematical relation or correlation between them. In other words, are the data points and the different data sets moving together in some way, shape or form. If there is a mathematical relation or correlation between the different data sets, the next logical question would be, is there a cause and effect relationship which is causing the correlation or mathematical relation between the different data sets? And if there is a causal relation between the different data sets, the next logical question would be, what's the causal factor and the causal relationship which is causing the correlation or mathematical relation between the different data sets? Now in prior presentations, we thought about a perfect positive and a perfect negative correlation which are great to consider in theory, but which are not usually the examples we have in practice where we don't have a perfect correlation, we usually have some kind of trend. We saw that in another example where we had a very simple trend of only four data points so we can really analyze the formula that we're putting together in a simple, small or low data point example. This time we'll have more data points but we'll actually generate the data points to get a conceptual understanding of how we're generating the data points and then what we might assume would happen therefore with the correlation and then we'll map that out. We'll also take a look at what it means to have randomness to some degree and what randomness kind of looks like as we go through here. So we're just going to imagine we generate and excel our data, random data one with this formula, random between, and then we're just picking the low number, the bottom number, one to a hundred. Excel generating random numbers then between one and a hundred. We're going to do that two times over so we then created a second data set, same fashion to do so and that's going to be the data that we will be using. Now when we do this in Excel, Excel will keep on regenerating these random numbers so we're going to imagine that we copy this information over, paste it over here so now we have static numbers which will not keep shuffling around that are randomly generated. Now if I was to just think about how we created those numbers we can make some assumptions based on what we've done in the past. We can say well first I've randomly generated numbers not in accordance to a normal distribution or Poisson distribution or anything like that but more towards like a uniform distribution because we had random numbers which could equally come up between a certain interval between zero or one and a hundred. We also know that we generated these two data sets the same way so they're kind of related in that way. They're going to be numbers between one and a hundred but they're not connected in any other way in terms of how we created the two data sets. So we might have a hypothesis then that they wouldn't be highly correlated between them right because they're not exactly connected. So we'll kind of test those out as we go. Now first I want to look at it pictorially. Let's say we took just this first data set and we made a histogram of it. We counted all of the numbers here and see how many fall into the buckets of one to eighteen, eighteen to thirty-five and so on and so forth and you can see that it's not a bell curve type of shape or anything like that. If we did this indefinitely it would tend towards a uniform distribution or a straight line type of distribution. If we did this indefinitely we would expect kind of an even outcome because they all have kind of an equal chance. Now if we did this for the second one, same kind of thing. It looks different here just because of the randomness that happened here but you would expect the same kind of trend line where it would trend towards a uniform distribution. It's not a bell curve or anything like that. It would tend towards a straight line type of distribution. Now the fact that these both are kind of similar in nature may give us some understanding about the data sets but doesn't necessarily mean that they're correlated either. So then we're going to do our mathematical calculations. So we'll say okay let's do the mean and the standard deviation for the first data set which would be the average formula, the average summing up all of the data dividing by the data points gives us 48.82. The standard deviation is going to be the standard deviation of a sample this time of the first data set. That's a measure of the spread 29, 34, 30, 73. I got dyslexic there for a second. If we did that with a mean summing up all of this stuff and taking the average divided by the units we get to the 49, 51 and for the standard deviation if we took all of this with the standard div.s we get 26, 56. Now these could give us some indication where it's like the mean is similar and the standard deviations are similar so we might kind of start to think that they're kind of related in that way and these look kind of together but it doesn't necessarily mean that there's a relation because again there's not like a direct relationship in the way we created the numbers other than we created them in a similar fashion in terms of the random numbers between a certain interval. So let's do our calculations here and say that actually before I do that let's actually graph this thing out. If I took these two and made a scatter plot of them, graphing them together it would look something like this. So now we've got the scatter plot. If I was to do this it would automatically take the first random and make it on the left. I'm sorry make it the X. So the random one is the X. In this case I don't really know which should be the independent or dependent so normally we put the independent factor over here if we know it but we don't really know what it is. There isn't really one here because we made them completely separately we just used in essence the same technique to make the two datasets. So you can see here that if we plot these together we get somewhat of a random jumble of data points. So somewhat of a random jumble and if we plot the curve in there, the trend line we get a little bit of a correlation but downward sloping but not a high correlation obviously you can't really see it if you didn't have the trend line you wouldn't even see a general pattern right here generally with the dots. Now if I was to switch the X and Y's then remember you're still going to get that slight downward sloping it's not going to change to an upward sloping line and so now we just changed the X and the Y's. In this case we can see the same relation and we don't know which is the independent or dependent factor. Now just to get an idea of what randomness looks like versus what often people have in their mind of randomness we did another dataset just to kind of show this and this one we actually kind of put a system together to create our two datasets. What we did is we just took we counted by 5, 5, 10, 15, 20, 25, 30 up to 100 here and then we did the same thing here but we staggered it that's our starting point. I staggered it and we made we said 5, 10, 15, 20 but we started it like down here. So we had the same distances and so we had a pattern and then we made a random number generation on each of these tables and shuffled up the pattern. So now I've got two different sets of numbers that were created with this kind of distancing pattern. I didn't randomly pick between 1 and 100 but rather we basically had units of 5 that we put in place and then made a random number generation to kind of shuffle them up and the reason I want to just take a look at that is because if we then take these two and plot them the numbers that were generated from that process then we get something that looks like this. So you can see this is something that is not exactly as random, right? This has some randomness to it but this is more like something that might happen if you were trying to create randomness yourself. So if someone said put a random set of numbers but you had to think of them in your mind you would probably space out the numbers so they look something like this. So when you see something like this versus something like this this one is often going to be the more random set of numbers because you end up with this clumpiness like when we think of randomness we don't think that these clumpy things are going to happen but with randomness it does happen when we try to do something randomly we tend to eliminate the clumpiness that happens in randomness. So that's just something to keep it to point out here. Now note that when we generated this number whether they be randomly generated or not the two data sets that we created are still not correlated because we used the same process again but we didn't tie them together. So even though it's still not exactly as random as the other set we still have a very low correlation as indicated by the trend line. Okay so now let's do our calculation to see that mathematically so this is our z-score of the one times the z-score of the other divided by the n-1 so if we did that calculation we can take then the this is our first data set with our random numbers we take the z which is going to be each number minus the mean so in this case it would be 19 minus the mean which is going to be in this case minus 48.82 divided by the standard d 29.730 and that's going to be our negative about one and we do that all the way down for all the related items and we do that for the y as well so here's our random data set two and then here's all of our z's for it so that first data point in the data set 27 minus the mean for the second one which is 49.51 divided by the standard d 26.56 gives us 84 so we're going to go back on over say okay that's going to give us about 85 and then we multiply the z's together so that we get about one times 0.85 gives us about 85 of course and then this one's going to be the 1.54 times the 1.89 gives us hold on a sec K-PASO the 1.54 times the 1.22 is going to give us the 1.89 about rounding is involved here so if we do that all the way down then that will give us our sum we just need to sum them up to give us the numerator so we do that over here we sum them up we create a little table the numerator is just going to be the sum of this outer column the denominator I'm going to make a subcategory for it by putting n-1 or you can call it the denominator colon do the sub calculation internally like you might see like a tax return kind of format tabbing to show it's a sub calculation indenting in other words 215 is the number of data points so if we counted all the rows minus 1 and that's going to give us the denominator which I'm putting to the outside double indenting 214 I now have the numerator and the denominator of our formula numerator denominator on the outer columns of our worksheet and we can divide it out 25.43 divided by the 214 is going to give us the 0.11884 about so that's going to be our correlations fairly low correlation and we can check that with our data analysis which is under the data tab you have to turn it on so if you don't have it on you can go to the options and turn it on which we show in the worksheet how to do that when we do this in excel you can then pick the you could pick the correlation this I picked the correlation right there and that would give you an input field and then it would spit out if you pick the correlation this number we're looking at this one here which is comparing the 1 the rand 1 and the rand 2 negative 0.11884 which is what we got up top there so this one is not dynamic though and remember that you don't get all the the benefit of kind of looking at these z-scores as you're generating your worksheet so it's a great tool to have but it's somewhat limited to just kind of spit out the correlation at the same time now if I did the same thing with the second data set that we made the second data set which was less kind of random and I calculated the z-score in a similar fashion and I calculated the z-score for both of them in a similar fashion I multiplied the 2 z-scores together to get our column over here and then we did the same correlation calculation summing up the column to the right the right most column n minus 1 or denominator is going to be the number of items 215 rows they're not all here by the way they might be all here but I don't think I copied all the rows in our example in one note but they would be in Excel minus 1 214 dividing out the numerator and denominator gives us a lower 0.00815 and this is the mean and the standard deviation to help us with the calculations of the z-scores and then if I did that in Excel to double check it Excel's giving us a similar well the same obviously result here so the general idea then would be that when we're looking at the correlation we here shows randomly generated numbers to get an idea of the correlation we can come up to kind of a hypothesis of how they would be connected noting that although we looked at some of the stats that have kind of a lot in common you know the mean and the standard deviation are closed and they're both have a tendency that's going to go towards a uniform type of distribution doesn't necessarily mean that there's going to be the high correlation between them and then we mapped out that difference between randomness here and randomness here which is a good concept to understand but whether or not your data set is more or less random doesn't necessarily mean that it's going to have an impact on whether or not the two data sets are correlated together moving together in some way shape or form.