 statistics and excel correlation baseball statistics get ready taking a deep breath holding it in for ten seconds looking forward to a smooth soothing excel first a word from our sponsor yeah actually we're sponsoring ourselves on this one because apparently the merchandisers they don't want to be seen with us but but that's okay whatever because our merchandise is is better than their stupid stuff anyways like our crunchy numbers is my cardio product line now I'm not saying that subscribing to this channel crunching numbers with us will make you thin fit and healthy or anything however it does seem like it worked for her just saying so yeah subscribe hit the bell thing and buy some merchandise so you can make the world a better place by sharing your accounting instruction exercise routine if you would like a commercial free experience consider subscribing to our website at accounting instruction dot com or accounting instruction dot think of it dot com here we are in excel if you don't have access to this workbook that's okay because we'll basically build this from a blank worksheet but if you do have access there's three tabs down below example practice blank example in essence answer key a practice tab having pre-formatted cells so you can get to the heart of the practice problem the blank tab a blank worksheet so we can practice formatting these cells within excel as we work through the practice problem let's go to the example time to get an idea of what we will be doing where we will be going we're looking at correlation once again when we're thinking about two different data sets to see if there's a mathematical relationship between them are they moving together in some way shape or form in other words if there is a mathematical relationship the next logical question would be is there a cause and effect relationship if there is a cause and effect relationship the next logical question would be what's the causal factor in that relationship this time we're going to be pulling baseball statistics data and we'll compare a couple things from it will trim down our data set to start off with and then we're going to be looking at the age versus the batting average doing our correlations calculations from that data set and then we'll do another one looking at the correlation between the batting average and the RBI so let's go back to the left let's go to the blank tab we're going to pull our data set I'm just going to look up baseball statistics I find baseball statistics at baseball reference has some information for us I'm looking at the batting averages for 2022 and I'm going to see what they have for us down here so they have it by team I don't want it by team I want it by players I'm going to go to the stats by player and then in this drop down they say we can drop it down and pull this in I don't want to pick the excel workbook because it gets limited to 500 lines and we want all the lines so I'm going to make it into a CSV format and it gives our CSV information down below which is a comma deliminated format and then I could just copy all this and I should be able to just copy and paste this into our excel worksheet hopefully without giving us a virus or anything this is should I'm trusting the site here it's gonna it's gonna be just good good data it's just good data right here we're gonna copy this whole thing I'm gonna put this over in our worksheet in a one right click and paste it let's paste it either way is fine I'm gonna hold control scroll down notice it's all in column one here and it's a mess but I want to I want to transfer it from a comma deliminated data to a table so to do that I'm just gonna keep that first column highlighted because everything is in the first column we're gonna go to our data up top and we want to go to the data tools and then text two columns tool text two columns I'm gonna say it's gonna be deliminated delimitator charters such as commas let's keep that and then I want to deliminate it with a comma that's the deliminating factor not not the tabs and then down here you can see it's spacing it out like it looks like it should be so I'm gonna say okay and then finish it finish it and so there we have it now I'm gonna call this our data tab and I'm actually gonna make a separate data tab from the the one that we're going to be working in because we might need to do a little bit more cleaning up of the data from this tab so I'm gonna double click down here and call it the data blank tab and so there we have that and so that looks good the headers maybe I'll go up top and make this the header column by going into the home tab font and insert and if I go all the way to the bottom by hitting control shift down let's do it over here control shift down then this last row they give me a total I don't really want the total down here I don't think I did this last time and I kept the totals in there which wasn't what I wanted to do us delete the totals down here and so we just have the data okay so then I'm gonna go back up top and we're gonna say let's insert a table insert tables and I'm just gonna insert the table here and so there is our information so we've got the name we've got the age and so on and so forth so what I want to think about is is the batting average that's gonna be how many times people get on base in essence or what's the what's the what's the percentage of times that they get on base and let's see if there's a correlation between that and their age over here so let's pick those two up so I'm gonna pick up let's add another tab down here clicking on the plus button I'm gonna double click on the tab and call it a blank tab and then I'll just pull in my data and we'll pull it into this blank tab so I'm gonna go back to our data tab I'm gonna select the entire column of B which has our names the entire column of C which has the ages and then over here with the batting averages over here we're gonna copy that all that copy that and let's go on over to our blank tab and just paste it on down in column a we're just gonna paste it on down I think we could just paste it normal and it'll give us our formatting that's fine I'm gonna hold control and scroll up so there is our data now I'm gonna format the entire worksheet first and then I'll go back into your noting that these are decimals and these have no decimals so I'm gonna format the entire worksheet and then reformat these two columns to an appropriate formatting so I'm gonna select the entire worksheet gonna right click on it and we're gonna say that let's format the cells and I like to make it currency negative numbers bracketed no dollar sign I'm gonna remove the decimals and then add them back as needed removing the decimals okay and then I'll add the decimals back to this column because we need them in here clearly because that's what the batting average is represented as and this data set so home tab number group adding some decimals decimalizing it I'm also gonna make the entire thing bold because we like to work we like to be work in in the bold okay because it's better for the screencast you don't have to do that if you don't want to so then let's go ahead and do some of our our statistical calculations which we know by heart at this point in time with our actual data but before we do let's clean up our data a little bit let's go to the insert tab up top let's go to the tables and add a table I'm gonna put a table around this whole thing now note that when we look at this data the batting average has some blank stuff in it and also if people didn't have that many at bats then you might want to remove the batting averages for those items as well so I might sort this now by the batting average and I'm gonna say a to z and then everything with a zero in it I could probably just zero that out or I'm gonna remove that I don't think that's gonna be useful to our data set and this is where this is where tailoring your data set to try to get the most relevant data is useful or is applicable right we gotta say should it doesn't make sense to keep that in or should we pull it out how should we pull it out should we base it on how many at bats they had and so on I'm just gonna take all of these rows down to all of the zeros for sure remove them point five let's take it let's take it up to like point oh five nine everything above that we will keep it and assume as legitimate data so we're gonna right-click and I'm gonna delete all of that stuff and then if I sort the batting average from z to a if we have these really large batting averages that would probably indicate that they didn't have a lot of that bats possibly because that would be a quite a substantial batting average so I'm gonna say let's just remove at least this top one maybe these two I'm gonna remove these two and and these are somewhat subjective as to whether we should remove those are not right because that you have so that but that's it now so now we got the batting averages which are the heart of the data and I'm gonna say let's pick that up then and use it to our do our statistical analysis on so we've done this before we're gonna we're gonna make a skinny D we could do this with our eyes closed we could do it with our even if some jerk tied our eyelashes to our nose hairs forcing our eyes closed and to be watered at the same time we could still do this at this point so let's let's say that we're gonna pull in the data let's make this age and the batting average I'll make this black and white for the header home tab font group black and white for the headers let's center it alignment and center our normal calculations this is gonna be the mean the standard D and let's pull this in a bit we don't need it to be that large and then I'm just gonna do the average equals the average age of the baseball players control shift down and 28 so not now I'm not gonna be in there unless I'm not crazy phenomenon that could I'm gonna copy that to the right and let's do the batting average just go to the home tab numbers decimalize it let's put three decimals and there's the average batting average point two so we percentize that 22% so remember when we're talking about baseball it's the likelihood of actually getting on base is not that high is the general the general thing 22% is the average so then we're gonna say according to this statistic which we basically you can see how we trimmed the data that we have over here which might be a different way of trimming the data than some other stats that you might take take a look at which might use different techniques but that's what we'll go with now standard deviation equals the standard D we want for the sample this time and I'm gonna pick up the age control shift down as we've seen before enter and there is there's that for let's decimalize it home tab number group decimalize copying it to the right fill handle dragging it to the right and so there we have the standard D here home tab number let's add a couple more decimals on that one all right so then we could say well what does this data look like in terms of a histogram let's let's check that out let's take the ages and say insert chart and make a histogram of this and so there's our age histogram I should have I should have put it let's delete that and then I'm gonna I'm gonna go up to the top and then insert it so I don't have to drag it up chart insert histogram all right so this is our age histogram age histogram and so we've got people you know in it looks like it's kind of you know somewhat bell shaped not you know you got the people at the middle part is the higher point and then it tapers off although you've got this peak at 24 to 25 and then this other peak at 28 to 29 and then you go way out here and I'm past I'm would be an outlier out there but I'd still if I if I wanted to I could do it I could do it right here right now if I just don't want to if I did I would do it but it's stupid because I don't need it I don't need to do that anyway I'm gonna say control shift down on the batting average and notice I have some blank data down here still I thought we removed all of the blank data let's go back up top and let's redo that let's fix that batting average I'm sorting by batting average I deleted the low ones let's go let's go down and delete all of that and do this again control shift down everything below here I'm gonna delete so I'm gonna delete the blanks if they have no batting average we don't even want you here you didn't make the team what are you even doing on the team if you don't have a batting average get out of here let's do it again let's do the mean again equals the average making sure we pick up the right data copy that to the right decimalize it and then the standard D let's do that again equals the standard D go back to the miners man go back to the miners oh that's mean whatever it's how it is when you're in the big leagues I can't hack it home tab we're gonna say okay so so then let's do our our our age again let's do this age again thing now that we have our this data and insert charts histogram so there we have it alright so this is our age okay and then let's do one for the batting average control shift down control backspace and insert charts histogram boom oh no I did that again let's pull it down and this is gonna be the batting average BA and so again it looks kind of like it's bell shape it's tapering off we've got these outliers that maybe we should have trimmed off over here and that middle point being being around the mean so so that doesn't necessarily give us an indication that there's any any kind of correlation but it might give us some ideas sometimes about our hypotheses about the data so now let's let's do our let's take this and take our formula for the correlation calculation I'm gonna make it a large H and we'll say this is going to be smaller here oh now I'm moving it that's not what I wanted to do make a smaller one here and then let's make a smaller one here okay and then we'll pull this into the large H alright so it's still probably quite large I don't think I need anyways that's okay so then let's take our data and do our calculation for the age and the batting average I'm just going to take those two and paste that over here and so there we have it let's make our header tab this way go into the home tab font group header black white center it and then let's insert column J let's insert something between column J right-click and insert and then we're going to say this is going to be the Z of age Z of age and so we'll do our Z score this is going to be equal to brackets the 24 minus the 28 I'm going to say F4 on the keyboard and then close that thing up and then divide it by the standard D F4 on the keyboard dollar sign before the letter and number and enter then decimalizing it home tab number group decimalize double click in the fill handle to drop it down let's do it again over here for the Z of the batting average format paint the header cell home tab clipboard format painter for the Z of the batting average doing the Z score once again equaling the batting average data with what we need a bracket bracket and then the batting average data minus the mean F4 in the keyboard making an absolute dollar sign before the letter and number closing up the brackets and dividing by the standard D which is the point 0 5 5 5 and then we want an F4 on the keyboard dollar sign before the G and the 3 and enter let's decimalize it home tab number group decimalize to recognize and then double click on the fill handle to drop it on down let's make it a little skinnier from I to L because it's ill needs to be sick this is sick man it's ill all right so we're gonna close it up like that is that good or bad I don't know it's ill man it's ill all right let's do then the next one is gonna be the Z age times times the Z of the BA and let's do this formatting home tab font group black white let's wrap the text this time and center it so now we're multiplying these together so we took the the X minus the mean over the standard D the Y minus the mean over the standard D which is the age and the batting average now we're multiplying them together this is going to be this the Z of the age times the Z of the BA or batting average let's decimalize to recognize and then double click in the fill handle to drop it down dropping it down all right and then we'll do so now we've got we just need to sum it up to get that numerator so we'll close up let's make the end a little bit thin and then we're gonna say this is going to be the R or core correlation I think I spelled it right this time this is the last problem I believe and the spelling home tab font group has been done I can learn I learned stuff some of the Z of age times the Z of the BA tab tab putting it in the outer column summing it up equals the SUM shift 9 and picking up our last column control shift down and enter let's decimalize it to recognize it a little bit better home tab number if you want to recognize it better you need some of those you need to reveal the decimals if otherwise you don't even know who they are I don't even know who you are some column then we're gonna do the denominator n minus 1 subcategory calculation we're gonna say n just n just n don't try to help me out Excel less one you're not helping you're not helping Excel I don't care if you're trying to help trying to help I'm gonna say let's say let's select these two my house can't be rebuilt with good intentions Excel you've you're messing it up home tab alignment space this is going to be the count so we're gonna count the number of rows we have equals the count I can count any one of these let's just pick the first one control shift down and enter so we have 815 columns of data less one according to the numerator in our form you lie equals 1 815 minus the 1 814 let's put an underline under here home tab font group underline and let's say this is going to be the R or correlation so the correlation is going to be equal to the numerator up top now in the outer column divided by the denominator it's zero let's go to the home tab number and decimalize to recognize so point oh five so not the largest of correlations there we could also plot this out we could plot this thing out and say all right what if I plot it out then check it out that way let's do this to data sets control shift down and say that we're going to say control backspace and insert charts scatter plot and check it out that way so now you've got let's put the labels on it plus button access labels it's going to pick up by default the one on the left being the X so I'm going to say this equals the age and then this one over here is going to be equal to the BA or batting average so you know usually we would be thinking this would be the independent variable so you can question if that is the case as the age goes up what you would kind of think possibly the batting averages would go down except maybe as the age goes up they're able to consume more more drugs or something and so maybe they get more yoked up as they get oh I don't know I don't know I'm just guessing here but let's just see what the data says in terms of the correlation line the trend line let's put that one in place do you don't see much action happening they're not a very high degree of correlation at all if we tied it if we could know how much how much juice they were taken that would probably give us a better line I don't know I'm just kidding let's go to the first tab we're gonna say let's hit the let's hit the drop down here make it a straight line let's make it orange and so we have a bit of a bit of you know a negative trend but nothing that's looking absolutely telling now I'm gonna pull this to the right and say well what if obviously if you were to to make assumptions about this data you'd probably be saying well if the age goes up their batting average would go down that would be the hypotheses that you would most likely be making making the age the dependent or the independent factor and the batting average dependent it would look fun so but you could think about it the other way let's reverse this chart just to just to do it so I'm gonna go back to this tab and select the same data control shift down and control backspace and then insert charts scatter plot and this time I want to change the axis so let's add an axis axis title but this time I want to make the x equal to the batting average as if it's the driving factor that is going to be correlating to the age so and so sometimes you don't know which is which and you could plot it either way you're still going to get the same kind of correlation direction negative correlation in this case but I need to flip the data around then to do that so I'm gonna go to the charts I'm gonna go to the select data up top pick our data here and edit it and then this x factor I'm gonna delete it needs to be I want to make that the BA data control shift down control backspace and there it is make sure and then on the why I'm gonna delete this data and make this equal to the why data control shift down control backspace and boom so there we have that one and okay okay so there's that scatter scatter if I add the trend line in that one I still get that slightly downward sloping trend line if I then take a look at the options and and format it we're gonna go to the bucket up top make it a straight line and orange for the buckets or orange make it orange that's the format that we like to see the line in alright okay so then we've got that and so then I can let let's let Excel do the correlation calculation and see if it gets to the same point oh five that we got to here so let's try that so let's go let's go to the end to the data tab to our analysis if you don't have that analysis it's in the file tab it's in the options it's in the the add-ins and then it's in the add-ins down here and then you go to that and then you click off the tool pack if you want to be part of the cool cats alright so then we're gonna go up to the data analysis and we're gonna say this is gonna be the correlation of okay let's go into it first hitting the tab which where's the data set we want to go to we want this data set the age and the batting average control shift down control backspace going back up top we picked up the row for the column so we want the labels on the first row where do we want to put it on our worksheet we want to put it right there and so we'll say okay and so there it is it looks like we've coming up with the same correlation let's add some decimals to better recognize and so there we come up with the same correlation movie B to the end let's make a skinny let's make this black and white home tab font group black white there there that is now we can also do our our standard statistics on this one just to practice that for the fun of it with our data data tab analysis data analysis we want the descriptive statistics and then we'll pick up the data inputs here control shift down control backspace we want to pick up the labels output range is over here we want to put it over here somewhere right there and then I'm gonna get the summary statistics and possibly the confidence at 95 and ok so that spits out our general data here it's not dynamic it's hard-coded in other words but there it is let's format this I'm gonna format this and say let's go to the home tab black white I'm gonna make a skinny V I'm gonna make all this blue and bordered and the next time we'll pick some other stuff and see if we can get a higher correlation so I'm gonna go to the home tab font group bucket drop-down not black but that blue if you don't have that blue I use standard blue right there and border it as well border blue and then I'll make this border blue and then I'll make this border blue and then I'll make this whole thing control shift down border blue and then I'll make this border blue and then I'll make even this border blue control shift down border blue we'll make the same skinny columns too but I'll wait to do that next time as we pick some other data sets again and try to see if we can find a higher correlation between them