 Statistics and Excel. Wages, data, box plot or box and whiskers versus histogram. Got data? Let's get stuck into it with statistics and excel. You're not first a word from our sponsor. Yeah, actually we're sponsoring ourselves on this one because apparently the merchandisers they don't want to be seen with us but but that's okay whatever because our merchandises is better than their stupid stuff anyways. Like our Accounting Rocks product line. If you're not crunching cords using excel you're doing it wrong. A must-have product because the fact as everyone knows of accounting being one of the highest forms of artistic expression means accountants have a requirement, the obligation, a duty to share the tools necessary to properly channel the creative muse and the muse she rarely speaks more clearly than through the beautiful symmetry of spreadsheets. So get the shirt because the creative muse she could use a new pair of shoes. If you would like a commercial free experience consider subscribing to our website at accounting instruction.com or accountinginstruction.thinkific.com not required to but if you have access to one note we're in the icon on the left hand side one note and excel presentation tab the 1024 wages data box plot or box and whiskers versus histogram tab also we're trying to upload our transcripts so you can also go to the view up top use the immersive reader tool if you so choose you can change the language of the transcript and then either read or listen to the transcript in multiple different languages desktop version of one note here we have our information our data on the left hand side it's the same data that we saw in a prior presentation related to wages data or salary data that we are imagining for a corporation now once we have the data set how we sort the data set might be in some way dependent upon what we plan to be doing with the data set so for example you might have someone arguing about whether they deserve a raise or not and they might be using the relevant salaries of other individuals to make their argument for example usually however when we have a data set we're going to want to put it in some kind of order so that's the first thing we did we took the data which might be in a random order usually it could be in random order because you took a random sample and it might be ordered by the sample or because it's say in alphabetical order or something like that and what we would typically like to do is order it from lowest to highest or highest to lowest we can do that quite easily in excel excel being of course an amazing tool now once we have that oftentimes we want to then visualize the data that helps us to give a visual representation so we talked last time about the box and whiskers which gives us a nice visual representation and it also gives us those key kind of data points calculations on the box and whiskers the second type of visual representation is going to be the histogram so there's the reveal here's the histogram so the histogram gives us a nice spread a nice look of the spread of the data and this is quite important oftentimes because what we want to do is we want to get an idea of what the middle point of the data is but then generally we want to get a feel for how dispersed the data is now we can start to look at more technical calculations for that and we'll do that in future presentations but a general pictorial feel for the spread of the data will give us a much better kind of idea of what's going on with the data so we'll take a look at a lot of different histograms and we'll try to get some general terms for describing the general shape of the histogram in future presentations but if we just compare these two things you'll recall that the box and whiskers gives us a nice compact view with a lot of the the numerical data kind of packed into it so here we added the numerical data on it so you get to see everything the x representing the average or the mean you've got the line in the middle that's the middle point of the data you'll recall from the prior presentation and then you've got the quartile one and quartiles two on and quartile three in the box that that is in place and then of course you've got your outliers that are listed out outside of the box and outside of the whiskers and we use that kind of heuristic calculation calculating the inner quartile range to figure those items so you get a you get kind of some more of those technical numbers that you would like to see from a numerical standpoint in addition to the pictorial representation when you when you look at a histogram then the general idea of the histogram is we're going to be taking this data and we're going to try to list it in something that's similar to just a bar chart that you might be used to and that's this the bar chart is what we saw when we listed out how many words were in Shakespeare or Hamlet's Shakespeare and how many words were in Harry Potter so unlike the the Harry Potter example however we can't just list the number of words as the buckets on the x-axis we have to use a numerical representation on the x-axis now the problem with a numerical representation is that for example this data is going from 55,000 up to 84,000 so I can't clearly put every single number between those those thresholds in here I can't even choose the numbers that actually appear so so because because we might have out of the 51 numbers that we have here there might be a couple that are the same and and well and if I was to put a a list all of the numbers that are represented I might have a couple that have like five of them that are the exact same number however it's quite possible that we have 51 thousand 51 numbers that are completely all of them different because we're talking about salaries and so so we might not have any of them that are the same however we can still kind of group them together by saying well they're within a certain range so these two might not be exactly the same but they're within a certain range and that's going to be true of many numerical data so we might say okay I you know the all these numbers are different but they're within a range so that of course is our objective we would like to say what are going to be the ranges so in this particular case we've got the first range of 55 thousand to 58 thousand 400 so that's the going to be picking up this uh 55 which is the outlier here and then the second box that we have is the 58 thousand 400 to 61 800 nothing's in it the 61 800 to 52 65 200 nothing's in it and then once we get to the 65 200 to the 68 600 we find five of them in here so that's going to be the 65 to uh 65 to you know so somewhere above above and above here to 68 6 right to 68 so not including the 55 to 68 6 uh which is like here so 1 2 3 4 5 are included in that bucket right and then most of them lie in 68 6 to 72 so 68 68 6 is to 72 so if I go down to my data everything down to 72 you know around here so that of clearly is the biggest bucket 26 and then everything from 72 to 75 4 so everything from 72 like down here to 75 4 which includes basically everything to the outliers and then then you've got the 75 4 to 78 8 and then these two are representing the outlier buckets so so there we have our our pictorial representation given us a fairly good feel for the spread so so you don't have the exact calculations again that you do in the box and whiskers to give you you know that middle point is the mean versus the median and so on but you can get a feel pictorially for the middle of uh of the the histogram right so we can see it we can kind of get a feel for what the middle of the histogram we don't have a formal representation as we do with the box and whiskers which is still kind of a heuristic where we said everything that's above this threshold uh is going to be these outliers we don't have a a formal kind of uh determination of it being an outlier but we can see if we have depending on how we've grouped the boxes that these look like they're outside of the the general range so we could so we see that kind of intuitively uh in that way so notice that if you showed something like this and you were arguing that you should get a raise or something and you show and you show a pictorial representation of this and you're like yeah my and my wage is like down here and i'm doing the same you know then you know then you can see how this this picture can give you more more umph you know for the for the on a particular argument in some ways now one of the big things that we have to be able to understand is these boxes down below uh how how how are we going to construct these items and and excel then we have our a nice ability to do this now if you had to do this by hand it's kind of a pain to to make these boxes right if we had to do this uh and with just this set of numbers you know we'd have to count all the numbers and whatnot but with excel great tool we can just highlight the thing and make a nice histogram with it and by default excel will then give us a bucket size which excel for however they calculate it is going to say that that's the reasonable calculation and it's usually pretty good so if i but if i leave it on automatic then they give us the bucket size that's where the bucket size is coming from they made nine buckets one two three four five six seven eight nine buckets and the distance between them is uh 3400 meaning if i take the upper threshold the 58 400 minus the 55 the 55 thousand three thousand four hundred being the distance between them now you also have this kind of thing where the end of this bucket is the starting point of that bucket so what if something landed exactly on uh 58 400 i believe it would be in the second bucket and not included uh i'm sorry it would be in this bucket so it so if it was 58 400 it's going to be included here and this bucket represents everything above uh 58 400 up to and including 61 800 we'll talk about that a little bit more uh in a future presentation uh but just to get a general idea that of you know you're going to say well what if it falls on the line there right so if you change the bend size in excel and i changed this down to like 2000 for example and we'll play with this more in future presentations a lot then that will automatically change generally the number of of buckets or bends so that's going to increase the buckets or bends so if i play with these two in excel which if you want to look at the excel one we will do then uh then then you're generally and the other one's going to change too now note you can also make many changes to the histogram which can kind of alter the look in the feel of the histogram which we'll talk about more later because if i added a ton of like a whole bunch of buckets then i might have so many buckets down here that i lose the meaning of the histogram because now it's too detailed but if i have too few buckets like if i only had one or two buckets then it's not going to give me any feel of the spread anymore either uh also note that if we if we go down here we have the overflow bin and the underflow bin meaning if i have these outliers that are laying that are way on the outside then that's going to distort my my overall chart because because now i have to have this very long chart in order to pick in these outsiders so i could say hey look i'm got everything that's over a certain amount i just want you to put in this bucket and in that way it can be useful because now you don't have this chart with a bunch of empty space in it because the outliers are way outside however it could also kind of distort things a little bit because now you've you've kind of reduced the ability to see you know how far out the outliers are which you know so there's pros and cons to that also note that if you were to like extend the the data range over here to make a higher data range let's say you made it go up to like a hundred the chart would look more squat it would look smaller and if you were trying to emphasize the data more and you put less of the data you know you make it as small as possible to make the the bars as high as possible these are kind of things that that you might play with for a couple different reasons one because you're trying to get an accurate picture of the data so you're you're trying to say look if i remove the outliers that'll give me a better sense of what i'm trying to figure out with this which is this middle the heart of the data and also people use those are the kind of things people use to manipulate right if they're trying to if they're trying to emphasize a particular point then they might try to over emphasize these these middle points and and what not to make them you know appear taller than in the pictorial representation so what we want to do is we want to be able to adjust the histograms so that we can get the most meaning from it and we want to be able to see how histograms are adjusted so that when people kind of use deceptive manipulation of the histograms we can say ah you know i'm not sure you're given an honest representation of the data because of this or that or if you looked at it from a different angle we might come up with a different result so that's a general idea so of so so you'll also note that this histogram kind of approximate you can kind of think of a curve that would kind of approximate the histogram and we'll start to think about that later that the standard curve that people comes to people's mind is like a bell shaped curve the histogram isn't necessarily always going to come up with like a bell shaped curve right it depends on the actual underlying data that is involved in it so we'll talk about different formats uh uh being able to describe different data layouts that that could arise in a histogram so we can kind of explain them if we can come up with a with with a mathematical curve the most famous one being a bell curve but if we can come up with some kind of curve that we can define with a fairly simple mathematical equation that would be great because that allows us then to make predictions mathematically a lot more easily whereas if you have a set of data the data could not be applicable to a curve you might have a data set that's just uh you know there's no way to come up with a mathematical representation very easily without you know complex calculus or something uh so so that makes so anytime we see when we see a trend that we can approximate with a pretty simple function that we can represent with a curve or a line or something like that then obviously that that would be a more simplified thing to do uh from a mathematical perspective but for now we're just going to look at these different kind of histograms and see if we can then explain the data uh on the histogram like it's skewed to the left and skewed to the right or something those types of things to try to describe this pictorial spread of the data so the general idea here is that we want the actual the actual data over here is great the average the median the mean quartile one quartile two the pictorial rest of representation of the box and whiskers gives us both a pictorial representation and some of those more technical uh kind of numbers and a general sense of the spread uh but then when we're really looking at the spread that's when we really would generally probably go to like a histogram type of format something like this to really get a sense of what the spread is and then of course the next step would then be is there's some way to approximate this shape with an actual line that maybe we can even make a a function from a mathematical equation and if we could that might help us to to to to make predictions based on the equation which would be nice