 statistics and excel bell curve batting average data comparison example got data let's get stuck into it with statistics and excel first a word from our sponsor yeah actually we're sponsoring ourselves on this one because apparently the merchandisers they don't want to be seen with us but but that's okay whatever because our merchandise is is better than their stupid stuff anyways like our trust me i'm an accountant product line yeah it's paramount that you let people know that you're an accountant because apparently we're among the only ones equipped with the number crunching skills to answer society's current deep complex and nuanced questions if you would like a commercial free experience consider subscribing to our website at accounting instruction dot com or accounting instruction dot think of it dot com you're not required to but if you have access to one note we're in the icon left hand side one note presentation 1638 bell curve batting average comparison tab also uploading transcripts to one note so you can go to the view tab immersive reader tool change the language if you so choose be able to either read or listen to the transcript and multiple languages tying into the video presentations with the timestamps one note desktop version here in prior presentations we've been thinking about how we can sort represent present different data sets using both mathematical calculations like the average mean median core tiles and with pictorial representations like the box and whiskers and the histogram remembering that the histogram is the primary tool we use to visualize the spread of the data using descriptive terminology to describe the spread of the data on a histogram such as it's skewed to the left it's skewed to the right we then want to think about different formulas that can represent lines or curves that can possibly approximate different data sets that happen in the real world if we can do so that would be very good because the formula can give us more predictive power about whatever the data set is about and we looked at different data sets such as or curves or lines or formulas such as uniform distribution binomial distribution poisson distribution exponential distribution continuing on now with the famous one of them all the normal distribution or bell curve we're now going to be applying it to the batting average scenario what's going to be different from this presentation than some that we have seen in the past is one we're talking about something that's kind of similar to a job setting type of situation oftentimes any type of job that you are in when you're measuring performance you're going to have to use some type of statistical analysis and when you're looking at sports obviously we're basically judging the performance of the players and their particular job you can apply the same concepts to other jobs as well also we're going to be doing some comparative data so now we're going to be looking at batting averages on two different years and think about how we might use this information to do some comparisons between say a player that was playing at a different time frame than the current time frame using against statistical tools that you might find in a job related settings such as having to take the percentage in the first point and the z scores then could be relevant as another tool to help us with the comparison we're also thinking about how we might take this information from an online resource the baseball is a great thing to play with because there's a lot of stats related to the baseball they've been tracking the stats for some time and you can find a lot of that information online if you wanted to pick up the information so we're going to get most of our stats we've got them from here baseball reference so this is one area we're not promoting this or anything promoting the website or anything but just to note that when you look up data online oftentimes as we've seen in prior presentations you can download it maybe as a csv file which is quite common formatting that's a comma deliminated file so in excel when we do this in excel and i highly recommend checking that out because we do do it in excel we can copy all of this information in a comma deliminated file paste it into excel and then transpose it into a table type of format so that's one way we can get the data once we get the data this is approximating some of the data from 1920 and some of the data from night uh 2022 now if you were to put this into excel for example you probably want a whole different uh tab related probably to 2020 and 2022 data and then have your actual calculations maybe on another tab or pull the information that you really want either by copying and pasting or by using a pivot table possibly to pull the information into another tab because you might be sorting this data to pick out the data that you want or filtering the data and when you filter the data then you're going to be losing some columns so if you're working on the right side of an excel sheet then you're going to lose some columns and it can kind of mess up your calculations in that case so what we're looking at we have all this data which is great for baseball we're looking at the batting averages here so we're going to be picking up the batting average for 1920 and the batting averages for 2022 now we're just just practicing with the data in other words i'm not giving any guarantees about the results that we come up with here we're just practicing the statistical techniques with the data that we have found for baseball statistics year 1920 versus the year 2022 focusing in on the batting averages if you're not familiar with baseball the batting average is is having to do with the hitter the hitter is coming to the plate trying to hit the ball and they want to get on base as opposed to getting out we're not concerned with which base they get on do they have a single double triple a home run we're just concerned about whether they get on base as opposed to getting out now because there's a high chance of getting out either by being struck out or by being thrown out at first or having a pop fly or something like that the likelihood that they get on base is usually going to be less than a 50% right less than 50% chance that they're going to get on base now you can compare this of course to similar kind of scenarios in a business type of setting if you're talking about sales calls or something like that you might be saying well how many times did they actually get a sale versus how many calls are they making if you're trying to judge the performance of a particular sales person note when you're looking at sports statistics there's often going to be a nice correlation to similar areas in other job related areas because baseball is basically a job for you know the baseball players so we're looking at the batting averages now when you have all of this data for the batting averages we might want to then sort this information of the batting averages high to low and low to high and then you have the question of do we want to just keep all of the data or do we want to trim out some of the data so for example we might have some information in our data set with hat which has nothing in it maybe they didn't have any at bats we had players that didn't have any at bats and it's just blank there do we want to keep the blank cell in there or do we want to remove the blank cell most likely we would want we would want to remove it you would think to have more accurate information so it doesn't skew our data what if there's a zero in there so if there's a zero then the question is well maybe they didn't have that many at bats and the question is well should we keep the zero in there in there or will that skew our data set we can also have maybe a hundred percent what if they got a hundred percent well if they got a hundred percent that's quite unlikely either they're an exceptional hitter or maybe they only had one at bat and they happen to get on base and and therefore it's not quite accurate due to the fact that you didn't have as many at bats so you have the question of should we trim off the ones that are over a certain amount that don't seem likely and again this would be similar to other jobs like a sales call situation if they had the likelihood of getting a sale is is like 10% for most people and one person has a hundred percent on their on their success rate well how many calls did they make maybe they only made like one call right and they happen to get a success so now they have a hundred percent that's not going to be an accurate in your statistics is going to kind of throw it off so possibly in this case as with the sales calls you might say I'm only going to count it if you had so many at bats or above and then I'll pick up the number otherwise I can get data that's kind of skewed so in our case I trimmed off the zeros and I trimmed off some of the high points which might not be exactly how you might want to do it if you got really technical about it you might look at the at bats for example or use other methods but it might be common for us to take our raw data and do some type of analysis so that we can we can take the the the numbers that aren't representative out of the sample in some way that's the general idea now they're not then I took our information for the two years this is the 1920 information this is the batting that's represented in a percentage now if you're talking about batting averages you're trying to see how how often they got on base versus the total times that they had at bat it's a ratio or percent which can be represented as a decimal or as a percent typically uh so but when we do our analysis for a normal bell curve distribution sometimes it's useful to represent it as a whole number at same thing we did with our test score example if you if you watch that one test scores for grades are actually are usually in the format of percentages or decimals but if we can also say I got a 70 I got a 70 right which which means 70 percent or 0.7 of the total but we're going to represent it at 70 so I'm going to take all of our information multiply it times 100 so that when I do the norm dot disc calculation the percentages will add up add up so that's going to be our adjustment we did that for the batting averages of 1920 and then I did the same thing for uh 2022 so here's our data now represented in whole numbers representing then this this number represents that this person got on base 75 percent of the time 66 percent of the time this is probably a high number they might not have had that many at bats that might have been something that I should have trimmed off here these numbers look more reasonable 34.5 percent of the time they got on base versus getting out remember in baseball it's more likely that you're going to get out right because there's so many ways you can get out you can strike out how the fact that you hit the ball at all is like a miracle it seems like but you hit the ball you can still get thrown out at first you can have a fly out so it's not likely that you're going to get on base it's more likely that you're going to get out but in any case and that's similar to other jobs as well like sales calls you just keep at it man you keep at it don't let it get you down you just get right back up at the base and you do it again till you get on that okay enough pep talk so then we can take the mean and the average of our two data sets so this is the average for 1920 so the average batting average according to our data which again it might be a little bit different than some of the calculations that you'll you'll see if you look up because again they might have trimmed their data set a little bit differently than we did and there could be different reasons for that which we're not going to dive into in detail right now but we've got 1920 versus 2022 this is the difference between the averages then we looked at the standard deviation so here's the difference in the spread notice that 1920 has a greater spread a greater standard deviation uh of of the uh than the 2022 and then the median is if we sorted our data and then picked the one in the middle here's our median both of them are pretty close to the mean which is an indication that a bell curve might be a useful tool to represent our data as we kind of might expect would be the case because uh because we're talking about performance of people at a high level so you would think that you would have somewhat possibly have a bell curve performance in a similar way as you might expect a classroom of people of a similar level to have similar uh test scores right so then that would kind of clump in the middle and then taper off right and then the mode is the one that happens multiple times so so because we're talking about batting averages uh and and and you know they're usually clumped together pretty close the mode is is going to be possibly more relevant even though we're taking it out a couple decimals and it'll repeat itself and therefore it's close to the mean and if it's close to the mean that's another indication that a bell curve might be a useful tool so let's plot it out we're going to say okay the bell curve looks useful let's plot out a bell curve for 1920 and then we'll plot out the same thing for 2022 and then we'll look at the difference between the two and look at our actual uh charts on the left now before we do that the 19 if I was to take a histogram of the 1920 data and we made it we just took the data and made a histogram from it the buckets that excel comes up with I let it just make the buckets 4.3 to 7.3 7.3 to 10 and so on and so forth uh the middle point at the 22 to 25 around here has the highest bar so you can see it kind of looks like a bell shaped kind of shape over here we have some outliers which possibly are data points that maybe we should have trimmed out if we were looking at it in terms of how many at bats they had or something like that but it looks somewhat bell shaped so we're thinking a bell curve might be useful this is the 1920 here's the 2000 and 22 data you'll recall that uh the 1920 data over here had a standard deviation the spread of 7.38 higher than the 6.14 now that how that will look on a histogram will also depend on the bucket sizes so if we have a different range of the bucket sizes I let it so pick the bucket sizes but here's this one it still looks kind of somewhat bell curve but it might be a little bit more peaked type of bell curve because we might have a different spread different standard deviation on it so when we compare the two that could impact if we're trying to say who's the best who's the worst who's in the middle then those could have impacts who deserves a raise and whatnot and so so now we're gonna do our x's now if we're gonna pick our x's in order to plot out the bell curve then the question is how high and how low should we go now we're talking percentages so you might just say well why don't you go from zero to 100 that would that would be like test scores right it's gonna be how many times did you get on basis kind of the percent has to be zero to a hundred but we probably don't need to go all the way down to zero because we trimmed off the zeros and we and no one's going to be at 100 because that would be very good if they had enough at bats so we're going to go do our four standard deviations again so we're going to say all right four standard deviations for the 1920 data is 7.38 times four so there's four minus the middle point the mean 24.17 so that's going to give us then 535 it's negative it can't be negative well you're not going to actually get a negative number but if you took four standard deviations it's going to go down that low so it might be useful to plot it so that we get the full tail of the bell curve even though we're not actually going to see any numbers that are going to be that low in practice because you can't go negative so let's do it on the high side we've got the 7.38 times four and then plus the middle point of 24.17 that brings it up to 53.67 about because we have some rounding involved so when we do our x's I'm going to I'm going to add those low points and I'm just going to add one at a time because we're looking at percentages so now we're back to the positives one two three but I'm not going to bring it all the way up to 100 instead just bringing the x's up to the 55 that should be enough to plot our curve and be able to pick up the vast majority of the data then we can do the p of x calculations which is norm dot dist of the x negative five in this case the mean the standard deviation for 1920 is this number and this number and then we've got is it cumulative no so if we copy this down we've got very low amounts here but if I if I go around to here now the likelihood that they have a batting average their performance their average time to get on base is 21 percent of the time this represents a percent what's the likelihood of that happening 4.93 now the questions that we'll often ask when majoring job performance is going to be what's the likelihood that they get on base 21 times 21 percent of the times or above or between 21 percent of the time they get on base and 30 percent of the time those are the kind of questions that we would ask which means you could sum these up that would only give you an approximation though because we're talking about the area under the curve and you could use the norm dot dist subtraction calculation then we have the z score now the z score could be quite useful because remember that if I'm trying to compare two hitters that have different numbers of at bats like or similar to if I'm trying to compare two sales people who had different number of sales calls then how do I compare those two well one way I can do it is take the percent right that's the thing in and of itself so if I'm like okay this person had you know more at bats than the other person if I convert it to a percent I'm not just going to say therefore this this person got on base you know 100 times versus the other one got on base 75 times that doesn't tell me anything because they had different at bats but if I give the percent of times they had they got on base that could be a relevant number useful but if I compare this from one year to another 1920 to to 2020 then I might want further verification of how well they are doing in relation to the year that they are in given the fact that the circumstances are different represented in the data by the fact that you have a different mean as well as a different spread of the data that's where the z score could be useful so the z score you'll recall is calculated as we're going to take this this data point let's take a positive data point so it doesn't look silly I don't like looking silly I do it all the time but not on purpose let's go down here uh let's pick up like this one like a 30 31 so the z score is going to be then this 31 31 minus the mean minus the mean of in this case 24.17 divided by the standard deviation which is 7.38 and that gives us about the 9 the .93 so this is above the zero therefore it's on the high side so the zero is the is the middle point for the z scores so remember with batting of course in performance when we're talking about weights and heights then you kind of want to be at zero because that means you're normal you're not sick right you know you don't have you're not like way overweight or way underweight but when you're talking about hitting of course you want to be as far high of the norm as possible you're above normal would be what we're looking for up here and if it's positive that means that we're above the norm on the z scores all right so then we have the frequency so we can take our data over here and compare it to our actual data this is the this is the curve our actual data is over here so now we want to take the frequency of that data which is just saying let's bring it down here so we have some results so this is saying that some someone got above 8% but below or up to and including the 9% how many times did that happen in our data set three times so if I bring it down here we're saying how many times did someone have above 18% getting on base up to and including 19% how many times did that happen in our data set 17 times so this 17 I can't really compare to my actual bell curve very easily this result that we got from norm.dist unless I multiply times the number of results that I have so in other words if I look at the count that we did did I do the count over here let's see I didn't do I didn't do the count but if I if I add this up the actual counts of samples we have is 629 and if I multiply that times times what we got from the norm.dist which is 0.0423 we then we can make a comparison that's one way we can make the comparison the other way is I can convert this into a percent which is probably more useful especially for comparing year over year so I can look at this side and say all right I'm going to take for example this 16 and divide it by the total 16 over the total count divided by the 629 gives us if we multiply times 100 the 2.54 this one is of course the 7 over the 6 the 629 we get the 1.11 if I move the decimal over now I can compare this percents to these percents and I could look at the difference and say okay how representative is this norm.dist calculation to the actual data how representative is it of the actual data and then I can do the same thing for this is the frequency calculation I can do the same thing for 2022 I'll do this faster now I'm going to use the same x's and then we're going to do our frequency same thing but this time for 2022 and so that's our free this isn't the frequency this is the norm.dist calculation and then we have our z scores so this is this is our z score calculated in a similar fashion as we saw over on the last time but the middle point where z is zero is now between 22 and 23 and then we've got the frequency this is the number of times that our actual data populated in this case between or above 9% up to and including 10% 10 times and this represent our actual data divided by the total so that we can now compare our percentage actual data versus what we got with the norm.dist and then this is the difference between the norm.dist information and the percent of the actual data so we can kind of get an idea of how close we were so once we do that we can say okay well now I can I can compare for example this is my 1920 p of x is the blue this is the bell curve that looks like a nice bell and a bar chart and then the actual data so you can see the actual data lines up not perfectly to a bell curve but it looks like it conforms possibly enough that we can use a bell curve to help give us information about this and also note that if the bell curve doesn't conform to something like baseball uh batting averages it that might give you questions in and of itself because it's like well why wouldn't it because you would think that you have the top performers there so you would think that that uh the performance would follow some kind of bell curve distribution you would think unless something weird was happening here's here's the 2022 information it would be like test scores you know if you have people of pretty similar levels in a classroom you would kind of expect the outcome to be in a bell shape shape distribution if it's not then the question is well what is happening here what why wouldn't it be right that could raise questions in and of itself so here's the here's the blue and and so you can see this one for 2022 we had this kind of more peaked information here if I plot the two the two uh p of x for 2020 these are the two ones generated from the bell curve the even bell curve not the actual data you can see that the 2020 is is middle point is a little to the left and it's a little bit more peaked representing uh less spread of the data whereas the blue curve is has an average a little bit to the right and you have more variance uh of the data more spread of the data that we can that we can see there now when I compare then 1920 to 2022 the question is well how am I going to compare say an individual hitter that that is in these two years because uh because the fact that that their performances is going to be should be compared somewhat to the times and and how difficult it was whatever conditions were taking place at the times so first let's do the difference between the two we're going to say all right this is going to be uh the difference or just the x is again this is uh the p of x differences so now I can look at the difference between p of x for 1920 versus 2022 we can look at the differences between the z scores for 1920 and 2022 and we can look at the differences between the actual percent of the total for 2022 and 1920 and then of course you can go in and start asking questions and you can look at the z score is often going to be another factor here when you're comparing two periods because the general idea would be if I'm comparing job performance batting averages batting being their job then I again I could say well they had different at bats therefore I'm going to look at their percent of success not the total times they had on base to give me a comparative number so so a 23 percent would be would be better you know I can I can compare that to someone who had less at bats and I can use the percent to compare but if I'm comparing two different years then I might say okay well if I'm looking at a a let's say let's say I'm looking here at a 24 batting average 24 percent in 1920 that was actually below the the middle point they were doing less than the average performance on the z score which you could see here with the z score and you can also just see because the average would be you know the average here was what what we got was a 24 17 right so they're so they're actually below that just barely but if I compared that to a hitter in 2022 that 24 percent is is actually above the average right so now so now I'm going to kind of compare these two and I could when I start to compare different people from different time frames I can say okay well yeah they both did 24 percent on base but the conditions were different in the two time frames so how do I how do I compare that well that's what again the z score could be useful because you can say well this is what the average person was doing over here that was actually less than average whereas over here they're actually outperforming their peers in that time frame which could lead to all sorts of other questions in terms of well what happened why exactly might that be the case and so on and so forth and how does that reflect on the players skills pros or cons and so on so that can open up a whole another nest of questions but that's one of the general techniques that can come into play and remember it comes into play not only possibly with baseball and sports but any kind of job performance kind of situation you might have a similar a similar kind of process and the better you understand that whether you're trying to look for a raise or whether you're negotiating a raise with an employee or something like that or what's a fair compensation of one person versus another person or something like that then these are tools that you want to kind of be that you certainly want to be aware of if someone's doing performance reviews and stuff over the work that you're doing and that kind of stuff then you want to be aware of the statistical tools that are being used so that you can see if they're being applied well the person is doing this the statistical analysis has an idea of actually what they're doing and not just going through the motions and that you can you can analyze the results in a fair and objective way