 statistics and Excel mean and outliers get ready taking a deep breath holding it in for 10 seconds and looking forward to a smooth soothing Excel first a word from our sponsor yeah actually we're sponsoring ourselves on this one because apparently the merchandisers they don't want to be seen with us but but that's okay whatever because our merchandise is is better than their stupid stuff anyways like our crunchy numbers is my cardio product line now I'm not saying that subscribing to this channel crunchy numbers with us will make you thin fit and healthy or anything however it does seem like it worked for her just saying so you know subscribe hit the bell thing and buy some merchandise so you can make the world a better place by sharing your accounting instruction exercise routine if you would like a commercial free experience consider subscribing to our website at accounting instruction.com or accounting instruction.thinkific.com we are in our Excel worksheet if you don't have access to this workbook that's okay because we'll basically build this from a blank worksheet but if you do have access three tabs down below example practice blank example in essence answer key practice tab having pre-formatted cells so you can get right to the heart of the practice problem blank tab just having our data set in it so we can practice formatting the cells within Excel as we work through the practice problem let's take a look at the example tab to get an idea of where we will be going we're gonna have our salary data on the left-hand side we'll do some calculations for it including the mean create our histogram and then we'll think about what if we add one more line item to that salary data that being an outlier possibly the CEO's salary and what would be the impact on the mean and something like a histogram with that outlier in place so let's go on to the blank tab now if you want to have practice data sets to work with we suggest taking a look at Kaggle dot com K-A-G-G-L-E dot com you can also create the data set or simply type in a data set if you so choose it's not too long of a data set we're working with here let's do our normal formatting as we do every time I'm gonna select the entire worksheet or pretty much every time and then I'm gonna right click on it and we'll format the cells this is my baseline underline formatting we're gonna then go to the numbers tab on the left currency and then I want to make negative numbers be bracketed and read I don't need any dollar signs and right now no decimals are needed so I'm gonna remove the decimals and add them as needed and then okay I'm gonna make the whole thing bold by going to the home tab font group and the whole thing is bold that's how I work I'm all bold all the time that's how it is man anyways not really but still we're gonna then we're gonna then insert let's put our our tab here and we'll go to the insert and make a table out of it tables and let's put a table in around our data so I'll go ahead and insert our table let's pull that over here so we can see it dancing ants look to be in the proper locations I've added up top the formulas we created in a prior presentation so we will not recreate them but if you want to be able to type in the formula remember you can go to the insert tab symbols group equation and then within the equation I recommend using the ink equation helping you to basically create what you're writing into the format in Excel which is quite nice okay now the data on the left represents salary data we're imagining for a corporation for example so let's do some of our normal statistical calculations before we do I want to make a skinny bee column because I don't want to put anything right next to the table so I'm gonna put my cursor between column B and C left-click and make a skinny bee skinny bee alright and then let's type in here that we're gonna have the average or mean let's say mean or average mean or a or average average and this is gonna be created from Excel and I'm gonna make the column C a little bit larger I don't think I need the colon here and then I'm gonna do the formula equals the average and then double click on the average as we've seen in the past I'm gonna select the data because it's in a table I can just use the down arrow so the dancing ants are now around the entire data set they're doing their Mamba around and working their magic their their voodoo magic that somehow creates a formula when they dance they can make rain too I hear probably not but still median median calculation equals the median and double click on that that's the one where we if we ordered the data from top to bottom we would pick the one in the middle so I'm gonna select the whole data and then say there's the median so if I was to order it in order then that would be the one where we pick the one in the middle and then we have our max let's do the max calculation equals the MAX this is gonna be the largest number so I'm gonna select the max select our data and enter and then the men calculation equals the men it's gonna be the smallest number these are quite useful formulas select the entire things or functions you also might have quartile one right Q1 that we can put in place equals quartile one which is kind of like the median but for the first quartile right I would select the entire range and I have to add then comma one I'm doing this fairly quickly because we've seen them in the past and then Q2 or not Q2 Q3 because the the median is Q2 Q3 equals quartile double-click the quartile the entire set of data comma and this is gonna be 3 for quartile 3 so there's our data set now let's recalculate the average and this is what I mean by basically converting something that's in a in a formula standpoint instead of doing the algebra on it and doing it by hand and instead of going to the other extreme of just using a function or a formula in excel we can kind of make a little table for it right we can create a little formula which is useful for us to see the steps that are being taken and in excel so we can say let's do the average or mean calculation and this will be the manual calculation now this is you can think of this kind of like making up a tax return or something right you're doing you're showing your work in like like a tax return format right we could see the formulas line by line so in any way so we could say we're gonna we're gonna if I if I look at this this format what I need to do is sum up all of the x's right so this formula means x bar the mean or the average of x1 x2 x3 these all represent x right for the number of x's that are here so x1 x2 x3 and so on and then we're gonna sum all that up and divide it by in same thing here the sum represents the same thing as the numerator up top we're summing all the x's where x starts at I and then it goes to n I equals 1 to n divided by the number of n so we can we can tap that down here and I could say well let's calculate the numerator which is basically the sum values of we could say x's the x's equals let's use our trustee some function this time summing up and everything in here so we'll sum it up all these items which represent x1 x2 x3 on to xn for our however many x's are there and then we're gonna say that we want to the divide divide by number of values which is we're representing as n in our in our formula right so we're gonna divide by the number of values to get that I'll use the count function equals count brackets and we're simply going to select our data set which will count all of the items that's the number of items there's 51 of them and that will give us our average or mean and then I'm gonna say this equals up to to the sum of the data divided by up one for the 51 there it is so we get we get the same calculation here and here let's put an underline home tab font group underline now if you're making a table oftentimes I might make the top like a header right home tab font group brackets let's make this black and then the font white you might indent some of the internal sides right so the useful little indent thing home tab alignment indent so because it has a colon this is a sub calculation within it and then down here the total I might double indent home tab alignment double indent so I might make this a little bit wider so there we have it let's put some blue around this because that's my normal data input color home tab font group drop down bucket if you don't have that blue it's under the more colors standard and then blue so we'll say okay I'm gonna put borders font group drop down borders so there we have it I'll do the same thing here with our data here home tab font group blue and border okay so there's our normal stats and we thought about our formula in a table format a little bit more rigorously now let's let's go and make our histogram so I'm gonna select the entire data set and I'm gonna go into the insert tab and go into the charts and histogram and let's make a histogram of it I'll delete the title because we could see the data set and it looks something like that so there's our histogram so that looks good we can of course change the number of buckets on the left if we want let's let's play with that a little if I hit the buckets on the left let's make it it's on automatic I'm on this symbol I can say the the let's say the number of bends let's say we want like 11 bends so then I have something that looks like this and so we've got more data on the left and then we've got a couple data sets on on the right so there's kind of our our mapping of the data it's it's not really populating as like a bell shaped curve right it's it's tailing off to the right you know kind of shape of the data all right well let's just take this what we want to do now is just think about well what if there was an outlier to the data right so now we're gonna add an outlier so let's take this entire thing and I'm gonna I'm going to copy it to the right so I'm gonna take this entire starting point and because everything is relative to each other I could sorry about that I think I swallowed a fly the fly was so sick of hearing me talk it took a it took a kamikaze death charge down my throat and tried to stop me from talk any case we're gonna let's copy the whole thing so I'm gonna go from column a I'll select the whole column and then I'm gonna go on over and we can go all the way to let's say column L control C our copy right-click and copy if you so choose and then I'm gonna put it in column in so in column in right-click and I'm just gonna paste it normal so now it pasted the whole thing just normal right so now I I'm gonna make a skinny column M now I want to make sure that everything is populating the way it should so in other words like this this is now pulling from my new data set it's not pulling from the old table over here that's what we want to have happen so that one looks good they all look like they're pulling from my new data set which is the same numbers but a different table which I'm now going to modify so that looks good now that let's check the chart the chart is pulling from the old data set so it's pulling from this table so I want to change my histogram to pull from the new data set this data set so I can select the chart one way I could do that is gonna go to the chart tools up top and then the data that's being selected I'm gonna go into that and so it there's the dancing ants doing their mamba around the data but it moved it jumped over here I'm gonna move it back and I want to take that series of data and edit it and then say you need to change where you do your mamba ants see it starts to go over there again that's not where you want to go I want you to dance around these ones dance around those ones and so I'll say okay and then okay and then okay so now we've got our dancing ants we've got the correct data set okay so now we're just gonna add one more piece of data these are all the data sets but now you've got the CEO the that's that's he makes one million dollars let's just bring it up to one million and you can imagine what's gonna happen that's a hundred thousand no no one million one million okay so now we got to see if our data sets picking up the right data so if I select all my data is it going all the way down it is because I made a table out of it so that one it looks like it's picking up the entire table the maximum looks correct notice our average has now jumped up to 89 354 versus the 71 out 498 so so we have a pretty significant impact on the average from that that that new data set so if I sum everything up here picks up the million here as well and then so here's our new average calculation let's take a look at the table I don't think the table is picking up the new data set if I scroll down it's not picking up that million so now let's change the table and say let's change this one easiest way to do it is I just click on it here and then it shows me the the data and I could just drag that box down to pick up the million and so there it is so now I've got a really kind of messed up histogram because the million is this big outlier and it's trying to pick up a number of boxes that will take in to consideration that that one dollar amount that's way way out so it looks like everything else is in this one box over here now if I try to make my boxes let's let's make this as wide as possible and then I'll try to make my boxes so that it it can pick up more of this stuff let's make it like this and that's one and I'll scroll it in a little bit alright now let's let's let's make it I'm gonna go to my my numbers and then here access and then I'm gonna make the number of bins go way up to like 450 bins right so now you can see now you can see what it's doing here it's got this one line item way over here and everything else is kind of packed on the left hand side let's do like 400 400 bins 400 so then right you got that one line item over there what if I did like 350 350 bins let's do we'll keep it let's let's do the 400 again anyways so so that's so so that's our so that's obviously a problem in that the outliers can have a huge impact on some of our statistical tools so when I'm when I'm using this histogram if I was to remove the outlier or another thing I can do is try to say that anything over a certain amount then I'm gonna say the overflow is anything over like 90,000 was it so now if I bring my bin size back down to automatic then then it's only got two bins that's automatic let's bring it back up to like 10 bins so now you've got something that looks more reasonable but this but this outlier is pulling pulling everything to that outlier right so just to demonstrate that I'm gonna remove that and and we're gonna say that we now have the bins like 450 bins just so we could see the impact of the outlier right so there it is so now now again so some of the issues that we have here are that the the the average is gonna be impacted by that outlier that outlier could even be larger right like what if this was like you know nine million you know then we have the the average jumping up to 243,000 and versus the 71 now notice if I take the median then the median is still is still looking relatively good because it's the one in the middle so that one outlier isn't impacting the median so this is kind of the the let's bring this back to a million so our histogram is back so so this is the kind of what we have to think about which number is most representative for what we are doing and again this is some another place where people will often play when they're trying to be kind of deceptive with numbers right so you got it you got to say well are they are they using the mean or are they using the average in which would be appropriate to use and in this case so if you're looking at if you're looking at like the average height of people and you have a bell shaped distribution of the data then the mean is often quite useful is a good kind of summary number to use if for example however that you have salary data and you have something like this where the CEO is way outside the normal and someone's trying to argue for an increase in their salary or the unions or something are trying to in trying to say that we need an increase in the salary because we're below the average salary well that might be a little bit distorted because because the average salary is taken into consideration these huge outliers that are out that are out there right so you might you know you might say well maybe the outlier people should make less instead of everyone else make it more you know and let unless unless you have the money to you know you it's hard to or or if you're taking home prices for example and when you take the home prices you have you might have a big mansion that's in the neighborhood that might be worth millions of dollars whereas the other homes are somewhere around the the average of the area which like say is a hundred thousand dollars or something well then you you have to you would think that the the median the middle number the average or the middle number in the data set would be more appropriate than the mean because the means going to be skewed by that big outlier or you could say well why don't I take this set of numbers and remove the big outlier take out the CEO salary and then that that middle number the mean number the average will make more sense for normal for the normal people like if I'm gonna if I'm trying to get a new job somewhere out of out of college or something like that then for me to take the the average of the of the company salary which takes into consideration the the high CEO salaries then that's not going to probably be representative of what I might be making in that organization whereas the median might be a lot more representative of what I might make in the organization or if I was to take out the outliers and look at the average salary without the outliers so again these are when we look at these differences this is a another area where people start to say well statistics is just another way to lie with numbers right but it's not but but when people are deceiving with the numbers they're there what they're doing is just like what they do with words they're choosing intentionally oftentimes the wrong representation that has some truth to it but which is misleading either through omission of the proper data to get the proper perspective or you know or just you know miss misinterpreting the data that has been given so so there's that