 statistics and excel standard deviation and variance large outlier impact got data let's get stuck into it with statistics and excel you're first a word from our sponsor yeah actually we're sponsoring ourselves on this one because apparently the merchandisers they don't want to be seen with us but but that's okay whatever because our merchandise is is better than their stupid stuff anyways like our trust me i'm an accountant product line yeah it's paramount that you let people know that you're an accountant because apparently we're among the only ones equipped with the number crunching skills to answer society's current deep complex and nuanced questions if you would like a commercial free experience consider subscribing to our website at accounting instruction dot com or accounting instruction dot think of it dot com not required to but if you have access to one note we're in the icon left hand side one note presentation 1446 standard deviation and variance large outlier impact we're also uploading our transcripts so that you can go to the view tab the immersive reader tool change the language if you so choose be able to read or listen to the transcripts in multiple different languages using the time stamps to tie in to the video presentation one note desktop version here data on the left hand side it being salary or income data that we can imagine is for say a corporation in prior presentations we've been thinking about how we can take data sets summarize those data sets in ways that we can draw meaning from that data using both numerical summaries as well as pictorial summaries numerical summaries including our standard statistical calculations such as the mean or average the men quartile one median quartile two and so on and so forth as well as pictorial representations like the box and whiskers and the histogram now the histogram in particular is kind of what we're focused in on here because we're looking to see more of the spread of the data rather than simply the middle point of the data the histogram giving a nice pictorial representation now we're looking for more of a mathematical calculation of that same kind of concept the standard one being that we're using the standard deviation and the variance which we've talked about in prior presentations what we're going to do now is take our salary data and then look at the impact of the outlier a similar technique we've seen in the past except now we have added the our added couple calculations the standard deviation and the variance and we'll look at the impact on those numbers from an outlier so if we do our normal statistical calculations that we've seen in the past on our salary data and then we're going to add another you know the million dollar down below but we have a couple a couple new calculations here standard deviation and variance so the mean we can do with our average function adding everything up dividing by the number of cells the minimum would simply be the smallest number in our data set we've got quartile one the one in the middle of the first quartile and we can use this formula in excel to pick that one up the median which could also be quartile two we can use either quartile formula or the median the median formula being more efficient that's picking the set the one in the middle if we were to order the data from smallest to highest for example quartile three the middle of the third quartile we can use a formula in excel to pick that one up and then we have the maximum the largest number in the data set so just from this data we can't get a great idea of of the spread of the data but we get a general sense of it because we have these quartiles that give us a general sense and we can also see if there's a big difference between something like the mean and the median for example that might be an indication that there's an outlier so this gives us some sense of the spread but it doesn't give us as nice of a sense as say the histogram the pictorial sense that gives us that visual sense and we have some other calculations that can give us the idea of the spreads as which are the standard deviation we're looking at the standard deviation for the population so we can add these to our normal kind of stats that we can do in excel quite easily with the equals std ev dot p we're picking up the data you know in the table we're looking at the entire population as a vote as opposed to if it was a sample which would be dot s in our formula and then the variance so this is the variance same thing where we have the population versus the standard versus the sample so we're using population data at this point in time now these numbers up top probably give you a pretty intuitive sense of the data in and of itself you can say okay if i'm imagining this data the mean i would say that's kind of like a focal point of the data this one is the one in the middle if we were going to line everything up this one is the you know in the middle of the first quartile middle of the third quartile the smallest number the biggest number that gives me some sense of kind of the spread of the data the standard deviation might not be as easy to to kind of picture but you can give you know an idea of the average distance from kind of the middle point is the concept of it so it's a little bit more difficult to to wrap your mind around but it's it gives you a different kind of look or concept of the of the idea of how the numbers are or how close they are to you know that middle point the variance obviously looks way out outside of what you would think with the rest of these numbers but when you compare different data sets sometimes that variance can be a useful tool as well so oftentimes when we're comparing the variance of two different data sets similar salaries for different corporations then that variance could be useful and we'll possibly dive into that more in future presentations but if i take a look at a histogram notice the histogram gives us more of a sense of the spread than then generally the data set over here just the numerical the numerical data of of these calculations right we get we get a little bit more of a nice intuitive pictorial sense of it let's let's take a look at the variance and the standard deviation if we were to calculate it here just quickly we've seen this in the past but we've got these two formulas the variance is basically a stepping stone kind of in the process of the standard deviation because you can see the entire variance is under here for the standard deviation and then we take the square root so if we did our calculations using our formulas instead of the function n excel we would take all of our data points minus the middle point that's why we're getting a sense of how how these points are structured around the middle point the mean which is calculated at that 71 498 the average right and so then i get our differences which could be higher or lower so remember that will always if i sum up my differences get to zero if we're taking positive and negative numbers because some are going to be higher and lower and then instead of taking the absolute value we square them so now we're squaring them as we can see here which is going to get results in this large number and if we take that large number divided by the count the number of items that's going to give us our variance which is still a large number and then if we take the square root of that that's going to give us our standard deviation so there's our 2007 51 and you can see that it gives us kind of a sense of how spread out the data is around uh around the the the kind of the fulcrum pouring point the mean so now let's do it again but this time i added a million dollars down below that's we're going once again imagine that being the ceo salary so now we've got this large outlier salary if i look at my calculations if i look at my average calculation versus this average then we can see the average is being impacted by the outlier so once again we're going to think about which numbers are impacted by the outlier and which are not and how will that affect our decision making the minimum number is still the same because the outlier is not on the on the low end if it were then the minimum number would give us an indication if i had a minimum number that was like one dollar because the ceo let's imagine that he said i'm going to i'm going to just take one dollar salary unless the company makes a profit makes more money then then maybe he only gets one dollar right and now you've got it like an outlier on the low end that possibly could give you an indication of the average being dragged down and then quartile one is kind of like the middle of that first quartile it's a little bit different here but it's similar because it's resistant it's more resilient to a big outlier so then if there is a big outlier the quartile calculations quartile one quartile two or the median and quartile three are going to be stronger indications possibly of what we might be looking for for looking to get a job in this place then the mean might not be the one to look at if we think of ourselves as kind of an average employee the middle numbers is possibly the the better one to go to to use in that instance possibly now if you're arguing for a raise or something like that then you might use the mean the average because it's higher and you're and you're going to try to make your argument although you want to be honest you know in the argument uh and not not come off as deceptive you know but in any case the maximum you can see the maximum is a lot uh higher because that's where our outlier is so we have this big difference between these two notice that we also have this big difference between the mean or not a big one but pretty big between the mean and the median which is the indication that there could be an outlier so this difference between the average and the median as well as this big difference between the average and the maximum um are going to are going to tip us off that there's going to be an outlier that's going to have an impact on the data set and then we've got uh the the standard deviation which is substantially impacted right so the standard so and this is going to be the main the main new point here the standard deviation and the variance also significantly impacted by an outlier right so so when we're trying to analyze those uh those numbers as well we've got to take into consideration what about the outliers what what is our objective here and what kind of impacts are those outliers having on our numbers and how do we want to be dealing with that so you have the same kind of issue like with the histogram that we saw in a prior presentation where where you have this these are kind of the the numbers grouped up and if I have this large outlier and I keep the boxes enough boxes so we can see the outlier then obviously the histogram looks you know not very useful but you can get the point that the outliers really skewing the data the fulcrum of the the uh the mean is way over here somewhere and this outlier is putting a lot of leverage on it you know on the outside now again you can make your your histograms so that that outlier everything over a certain point is an outlier so you might say everything over 90 000 is in the outlier so you have a histogram that looks relevant looks good here uh because that'll show you kind of the middle point of the data and the outliers over a certain point but the point here is we have to keep in mind what that outlier is going to do when we look at certain certain you know data sets and then ask what do we want to do with it do we want to trim off the outlier and then look at what these numbers would be these calculations including the standard deviation and the variance if we remove the outlier do I want to rely on the quartile calculations instead so that the outlier doesn't have that kind of impact and so on those are the kind of decisions to be thinking about now if we did our same calculation over here the long way uh with our standard deviation it would be the salary minus the mean this is the difference and then we square them notice this gives you a good indication of what's going on here because now you're like wait a second the mean is now 89 356 so all of my data every data point is is uh below the mean right it's all below the mean until I get to that one million and you see how that kind of gives you a better sense of what what's so sometimes actually doing the calculations like this gives you a more a better sense of what's going on sometimes as well it's pretty easy to do you can say well wait a second this looks kind of funny everything and then and then the outliers the only one that's above the mean because that's showing you that kind of fulcrum effect where that big outliers having a lot of pole and then and then we're going to say so so if we square the difference so this is this number and then we're going to say the count is 52 so we divide by the count to get the variance and then we take the square root and that gives us that 127 545 the same 127 five four 127 545 that we had here