 Statistics and Excel, Correlation Random Number Generation Example. Got data? Let's get stuck into it with Statistics and Excel. You're not required to, but if you have access to OneNote, we're in the icon left-hand side. OneNote Presentation 1730, Correlation Random Number Generation Example tab. We're also uploading transcripts to OneNote so that you can go to the View tab, Immersive Reader Tool, change the language if you so choose, be able to then either listen to or read the transcript in multiple languages using the timestamps to tie in to the video presentations. OneNote desktop version here thinking about correlation, having different datasets to see whether or not there's a mathematical relation or correlation between them. In other words, are the data points and the different datasets moving together in some way, shape, or form. If there is a mathematical relation or correlation between the different datasets, the next logical question would be, is there a cause and effect relationship which is causing the correlation or mathematical relation between the different datasets. And if there is a causal relation between the different datasets, the next logical question would be, what's the causal factor in the causal relationship which is causing the correlation or mathematical relation between the different datasets. Now, in prior presentations, we thought about a perfect positive and a perfect negative correlation which are great to consider in theory, but which are not usually the examples we have in practice where we don't have a perfect correlation, we usually have some kind of trend. We saw that in another example where we had a very simple trend of only four data points so we can really analyze the formula that we're putting together in a simple small or low data point example. This time we'll have more data points, but we'll actually generate the data points to get a conceptual understanding of how we're generating the data points and then what we might assume would happen therefore with the correlation and then we'll map that out. We'll also take a look at what it means to have randomness to some degree and what randomness kind of looks like as we go through here. So we're just going to imagine we generate and excel our data, random data one with this formula, random between and then we're just picking the low number, the bottom number, one to a hundred. Excel generating random numbers then between one and a hundred. We're going to do that two times over so we then created a second dataset, same fashion to do so and that's going to be the data that we will be using. So when we do this in Excel, Excel will keep on regenerating these random numbers. So we're going to imagine that we copy this information over, paste it over here so now we have static numbers which will not keep shuffling around that are randomly generated. Now if I was to just think about how we created those numbers we can make some assumptions based on what we've done in the past. Well first I randomly generated numbers not in accordance to a normal distribution or a Poisson distribution or anything like that but more towards like a uniform distribution because we had random numbers which could equally come up between a certain interval between zero or one and a hundred. We also know that we generated these two datasets the same way so they're kind of related in that way, they're going to be numbers between one and a hundred but they're not connected in any other way in terms of how we created the two datasets so we might have a hypothesis then that they wouldn't be highly correlated between them because they're not exactly connected. So we'll kind of test those out as we go. Now first I want to look at it pictorially. Let's say we took just this first dataset and we made a histogram of it. We counted all of the numbers here and see how many fall into the buckets of one to eighteen, eighteen to thirty-five and so on and so forth and you can see that it's not a bell curve type of shape or anything like that. It's going to if we did this indefinitely it would tend towards a uniform distribution or a straight line type of distribution. If we did this indefinitely we would expect kind of an even outcome because they all have kind of an equal chance. Now if we did this for the second one, same kind of thing. It looks different here just because of the randomness that happened here but you would expect the same kind of trend line where it would trend towards a uniform distribution. It's not a bell curve or anything like that. It would tend towards a straight line type of distribution. Now the fact that these both are kind of similar in nature may give us some understanding about the datasets but doesn't necessarily mean that they're correlated either. So then we're going to do our mathematical calculations so we'll say okay let's do the mean and the standard deviation for the first dataset which would be the average formula, the average summing up all of the data dividing by the data points gives us 48.82. The standard deviation is going to be the standard deviation of a sample this time of the first dataset that's a measure of the spread 29, 34, 30, 73. I got dyslexic there for a second. If we did that with a mean summing up all of this stuff and taking the average divided by the units we get to the 49, 51 and for the standard deviation if we took all of this with the standard div.s we get 26, 56. Now these could give us some indication where it's like the mean is similar and the standard deviations are similar. So we might kind of start to think well maybe they're kind of related in that way and these will kind of together but it doesn't necessarily mean that there's a relation because again there's not like a direct relationship in the way we created the numbers other than we created them in a similar fashion in terms of the random numbers between a certain interval. So let's do our calculations here and say that actually before I do that let's actually graph this thing out. If I took these two and made a scatter plot of them graphing them together it would look something like this. So now we've got the scatter plot. If I was to do this it would automatically take the first random and make it on the left. I'm sorry make it the X. So the random one is the X. In this case I don't really know which should be the independent or dependent. So normally we put the independent factor over here if we know it but we don't really know what it is. There isn't really one here because we made them completely separately. We just used in essence the same technique to make the two data sets. So you can see here that if we plot these together we get somewhat of a random jumble of data points. So somewhat of a random jumble and if we plot the curve and they're the trend line we get a little bit of a correlation but downward sloping but not a high correlation obviously you can't really see it. If you didn't have the trend line you wouldn't even see a general pattern right here generally with the dots. Now if I was to switch the X and Y's then remember you're still going to get that slight downward sloping it's not going to change to an upward sloping line and so now we just changed the X and the Y's. In this case we can see the same relation and we don't know which is the independent or dependent factor. Now just to get an idea of what randomness looks like versus what often people have in their mind of randomness we did another data set just to kind of show this and this one we actually kind of put a system together to create our two data sets. What we did is we just took, we counted by five five, ten, fifteen, twenty, twenty, five, thirty, up to a hundred here and then we did the same thing here but we staggered it, that's our starting point. I staggered it and we made, we said five, ten, fifteen, twenty.