 statistics and excel correlation baseball statistics got data let's get stuck into it with statistics and excel first a word from our sponsor yeah actually we're sponsoring ourselves on this one because apparently the merchandisers they don't want to be seen with us but but that's okay whatever because our merchandise is is better than their stupid stuff anyways like our accounting rocks product line if you're not crunching cords using excel you're doing it wrong a must-have product because the fact as everyone knows of accounting being one of the highest forms of artistic expression means accountants have a requirement the obligation a duty to share the tools necessary to properly channel the creative muse and the muse she rarely speaks more clearly than through the beautiful symmetry of spreadsheets so get the shirt because the creative muse she could use a new pair of shoes if you would like a commercial free experience consider subscribing to our website at accountinginstruction.com or accountinginstruction.thinkific.com you're not required to but if you have access to one note we're in the icon left hand side one note presentation 1670 correlation baseball statistics tab we're also uploading transcripts to one note so that you can go to the view tab immersive reader tool change the language if you so choose and be able to either read or listen to the transcripts in multiple different languages using the timestamps to tie into the video presentations one note desktop version here thinking about correlation where we have different data sets to see if there's a mathematical relation or correlation between them in other words are the dots and the different data sets moving together in some way shape or form and if there is a mathematical relation or correlation between the different data sets the next logical question would be is there a cause and effect relationship causing the mathematical relationship or correlation and if there is a cause and effect relationship the next logical question would be what's the causal factor that's causing the causal relationship which is causing the correlation or mathematical relation between the different dots and the different data sets we're not going to be looking at a baseball statistics we're going to be pulling our stats or imagine they pulled from when we did this in excel this baseball reference website we're not advertising for them but we're just getting our data there so we can imagine going to the website as we did when we work this in excel and you can check this out in excel if you so choose and we have the option of downloading the excel but it's limited so we were able to transpose it to a csv file which is comma deliminated and then simply copy this entire thing paste it into excel and change the formatting from a comma deliminated formatting to a table which is a common theme oftentimes because oftentimes data sets might be in a csv or comma deliminated file when we pulled it in we get something that looks like this we have of course baseball stats now note that baseball stats are similar to job stats so baseball is great because you have a whole lot of stats in baseball due to the nature of the game but it's a job for them and many of the concepts we apply when analyzing different baseball players can be applied to different jobs as well we're going to try to break down what are the essence of the jobs what can we measure in the job how can we use ratios and we can apply some of these statistical analysis to judge performance of one person to another and of course compensation you would think should be based on performance based on these kind of this type of analysis uh so we've got the age we've got all of these stats up top we're going to be focusing in on those pick like the age and the batting average so the batting average over here represents how many times someone gets on base so it's kind of a versus how many times they were at bat so if they're if they're at bat they're hitting if you're not not familiar with baseball they're trying to hit they're trying to get on base whatever what whether that be first second third or a home run and and they have a lot more likelihood of getting out because there's many different ways that they could get out with a pop fly striking out grounding out being thrown out and so on and then we're going to compare that to the age so as they get older the hypothesis might be our older players is it correlated that older players are going to have a lower batting average right that would be the hypothesis so we'll check that out and so i'm going to take my data here and we're going to focus in on just those those components so we pulled in the name and we pulled in the age of the players and then we pulled in the batting average and then we're able to sort them sort them here by these by this table now note when we look at the batting average the next common kind of issue that comes up is uh there might be some batting averages that shouldn't be in our data set possibly because maybe they didn't have that many at bats so maybe they had one at bat and they got a hit their batting average would be a hundred percent then one out of one however that's not a useful stat uh generally because it's because it's going to skew the data and really they didn't have enough at bats to really have a judgment so we could we trimmed down the data set here so that we're picking we're trimming out those that were very high and very low because the likeliness would be that they didn't have a lot of at bats and therefore they have outliers on the batting averages so keep that in mind when you're looking at large or a lot of data uh that you want to think about how can I adjust the data so that I can get to the heart of the meaning of what I am looking for now if I was to plot this out on a histogram if we just select all of the age and put it on a histogram it looks like this so it's kind of a chaotic histogram it's not exactly bell shaped it's it's a you know kind you got the I guess the middle point here but we have another peak at 30 to 31 and it's kind of interesting in baseball because experience might matter in baseball more than in other sports in other words sometimes when people temper down and actually get less uh edgy they might actually do kind of better sometimes because of the of the nature of baseball you need a lot of patience you know with baseball but then we've got the batting averages which looks kind of more like of a bell shaped curve which is somewhat what we would expect with the batting averages so we just selected this whole thing and entered a histogram and notice batting averages are kind of like performance it would be kind of like performance on a test and you would think that if you if you are picking people that were all the best based on their performance of being the best that if you took like the test scores of it meaning judge their performance you'd get something that looks somewhat like a bell curve if you didn't get a bell curve in that case you might think that something funny is going on because you would think that they're all similarly uh similarly good in terms of their performance because they're all at the same level they're professionals so you would think that performance would would mirror some kind of bell curve possibly so if it wasn't a bell curve you might that might lead to interesting questions in and of itself now if i take the mean of the data uh hold on i'm not if i take the mean of the data we just select all of this data and take the average adding up all the ages and divided by the number we get to 28 if we take the batting average mean adding up all the batting average and divided by the number of rows we get to the 0.22 or 22 percent in other words notice that's under 50 percent well under of course because like i say it's it's a lot more likely that when someone is batting that they're going to get out that they're then they're going to get on base at some and some way then the standard deviation of the sample for the ages this is the measure of the spread 3.66 and for the batting average 0.0555 so then we have our our uh correlation calculation so let's just do this in terms of the mathematical calculation for the correlation we take all of our ages and we're going to come take the z score so the z score we've seen multiple times i'll just uh calculate that it's going to be the 24 minus the 28 divided by the standard d 3.66 and that gives us 1.08 about there's rounding uh involved here let's just do another one and so we get the 25 minus the 28 divided by the 3.66 we get the 81 about and then we can do that all the way down we can do that for the second factor which is the batting average where we can take then the 0.5 in this case minus the mean 0.2216 divided by the standard d for the sample 0.0555 we get about five we can do that all the way down uh uh hold on we get about five there it is and then we could do that all the way down and then we can multiply out the two z scores so now we're going to multiply out the 1.08 times the 5.5.01 that gives us the 5.43 about so that gives us these two bits which we can now sum up and if we sum it up that will give us our numerator so if we sum all that up that gives us the 41 31 the numerator the denominator is the number of items minus one so we'll do a sub-calculation for that n minus one n the number of rows minus one is going to give us 814 then we can divide out the numerator and the denominator now in the outer column being 41.31 divided by 814 gives us about uh 0.5 07 and so on all right so it's not you know highly correlated here the age and the batting average when so we can kind of start to think about that and say well maybe that's the case because like I say if someone is in the league longer then maybe their batting average they get better possibly at batting and they've done more they've done more steroids but that's just kidding so maybe they bulked up in that time or something but no so so but you the performance could go up in baseball but if I took if I take my data tab over here and we test that out by going to my data analysis which you can turn on in excel by going to the options and turn on the analysis toolkit and go to the data analysis looks like this we can check out the correlation enter the data which would just be the two columns of data which have to be side by side so you couldn't take this just from the original data set we had to put them side by side if we're going to use this analysis tool for the correlation and that would then give us the age versus the batting average there's that negative 0.05075 so there's our double check that we did on our numbers if I was to then plot it the age on the x the batting average on the y you can see that the trend line is pretty again is doesn't show a high correlation right the dots are not highly correlated now when my hypothesis would have been that as they get older maybe their batting average would go down that's why I put the age over here on the x as the independent factor which would be driving the batting average right but if I flip these two around I can say I can say well what if I put the the age over here on the y you still get that negative correlation but now that but now it's just flipped the other flipped the other way around so it's just a matter of thinking of trying to get an idea of which you think is the causal factor and usually we try to put the independent factor or the thing that's causing the other factor on the x but that could just be a hypothesis here there doesn't seem to be a significant causal factor between just these now we can also go to our data analysis and allow it to give us our descriptive statistics just to show that in excel so we're going to give the statistics for the summary data and the confidence level and that spits out this which gives us a nice summary it gives us our mean our standard error the median the mode the standard deviation the ketosis sample various units range minimum maximum sum and count now these are not dynamic they don't move as we as we change the data or anything like that but uh they're a great tool to first start putting something together a model together and then you might do these with an actual formula just to double check or to kind of double check your numbers but this kind of analysis of the two data sets age and the batting average might give us some insights sometimes which might give us some more hypotheses and whatnot to see what we want to do going forward in terms of thinking about correlations and how these data sets might be related now if that didn't work we can say okay let's pick some other data so we can go back to our data sets and say well let's try the batting average and the rbi now the batting average is how many times they get on base and the rbis are the runs that were batted in so meaning the hit that they got drove a run in which actually scored a run or a point right of run so so that means so if that's the case then then then i would think my hypothesis would then be that a higher batting average would be kind of the causal factor and i would think that their rbi's would go up right because if they got more hits you would think the rbi's would go up now this one's a little bit careful a little bit a little bit weird however because you'll note that the batting average is in terms of a percent a ratio meaning how many times did they get on base versus how many times did they have the opportunity to get on base here represented with a decimal but could be represented with a percent whereas the rbi is just an absolute number meaning how many how many rbi's do they have so that's a that's a you got to keep the keep that in the back of the mind and say well that's a little bit kind of weird because obviously if someone had more at bats even if their batting average was lower they got on base less percentage of the time they might have more rbi's because they they were they had more opportunities to hit someone in right so some so you might think that maybe the rbi's we should do the rbi's as a percentage of at bats or something like that where we're comparing percents to percents so just something to kind of keep in mind but we'll keep these data and see what happens with it so notice the rbi's if i plot this i get an interesting histogram so now i'm just plotting the rbi's and most of them are the zero to nine point eight and then and then tapering down from that point and so and then we have the batting averages which we saw before which are going to be as we would think more kind of bell shaped because this is the average of kind of performance in terms of getting on base now again if we went to the rbi's up top and we took the rbi's like as a percentage of at bats and we did the same kind of thing and that we did with the batting average trimming off the really high numbers and the low numbers we we might in that case get something that's more bell curve but we'll continue with this let's do our calculation over here so here's our mean of the batting average and the rbi's meaning the average of these two numbers and the standard deviation or spread calculation for those two numbers then we'll do our mathematical calculation same kind of thing we did before yet we have the batting average we've got the z score same concept with the z score now the z score might not be exact the same because we might have trimmed off a little bit more or less in terms of the of the very high ones and the very low ones but the calculation of the z for the data that we have here is going to be the same kind of concept 0.311 each x minus the mean 0.221 divided by the z 0.06 and that's going to give us our 1.5 about we could do that all the way down we do the same thing for the rbi's which are absolute numbers here so now we're going to say all right the rbi's are going to be the 131 minus the mean on the rbi's 26.575 and then take that and divided by divided by the standard deviation 26.337 gives us about 3.95 about we could do that all the way down all the way down we can multiply out the z's and c's what that does 1.518 times the 3.965 gives us the 6.018 okay we can do that all the way down and then if we sum out that outer column that gives us our numerator so we sum that up that gives us our numerator 352 89 the denominator is going to be in minus one so we can then take the number of data points in this case 823 minus our one that gives us 822 and then we have the numerator and denominator in the outer columns so I can then say all right this is 352.898 divided by 822 and we get 0.429 so that's more highly correlated I can do the same thing with my data analysis tool to double check it and that looks good and then if I plot this I can say okay now we've got something there's our regression line our trend line and it looks obviously more positively correlated here so we're going to say all right yeah the batting average as the batting average goes up then the rbi's goes up that's kind of what we would expect we have here the batting average on the x and that's usually what we would do if we think that is kind of like the independent and then the dependent over here the rbi's but we could reverse it and it would look like this so now you've got it's still positively a positively sloping line but now we've got the rbi's and actually if you think about like the rbi's would be a would be a decent way to then think about what the batting average would be right according to this right if we knew the rbi's we might be able to predict you know what the batting average is now notice there's going to be of course outliers on these so like this one over here we had a batting average of higher than 0.6 between points you know 0.65 or something that's a very high batting average what would that indicate that would if they're that high of a batting average it would indicate that they didn't get on base i mean i'm sorry it would indicate that they didn't have that many at bats they probably had a few at bats which means the ratio isn't really a valid ratio possibly we could we might have been better off to delete that one because they probably didn't have that many at bats which means they didn't get and you can kind of see that from here because they if they had a really high batting average you would think they would have batted somebody in right and they didn't now maybe they were the leadoff hitter or something like that so they don't have any opportunity but still as the lineup turns over you would think that they would have opportunities to bat someone in typically so something seems kind of funny with that if you flip it over here you get the same thing but now here's that here's the dot up here at the 6.5 and now the rbi's are down below