 Statistics and Excel. Histograms with different bucket sizes. Got data? Let's get stuck into it with statistics and Excel. First, a word from our sponsor. Yeah, actually we're sponsoring ourselves on this one because apparently the merchandisers, they don't want to be seen with us. But that's okay whatever because our merchandise is better than their stupid stuff anyways. Like our Accounting Rocks product line. If you're not crunching cords using Excel, you're doing it wrong. I must have product because the fact as everyone knows of accounting being one of the highest forms of artistic expression means accountants have a requirement, the obligation, a duty to share the tools necessary to properly channel the creative muse. And the muse, she rarely speaks more clearly than through the beautiful symmetry of spreadsheets. So get the shirt because the creative muse, she could use a new pair of shoes. If you would like a commercial free experience, consider subscribing to our website at accountinginstruction.com or accountinginstruction.thinkific.com. You're not required to, but if you have access to OneNote, we're in the icon on the left hand side. OneNote and Excel presentations tab of 1040 Histogram with different bucket sizes. We're also attempting to upload our transcripts to OneNote. You can use the immersive reader tool to change the language of the transcripts and then either read through or actually listen to the transcripts in multiple languages using the time stamps to tie in to the video presentations. Desktop version of OneNote here, the information, the data on the left hand side. We are imagining to be annual salary or annual income of employees at say a corporation. Same data sets we've been taking a look at in prior presentation. Instead of having that data ordered by like alphabetical order of employees. First thing we typically do is sort that data either from lowest to highest or highest to lowest. That gives us a general sense of the data. And then we've been talking about pictorial representations, the major two being the box and whiskers and the histogram. We're now focused on the histogram, which gives us that nice spread of the data. And we looked at a histogram in a prior presentation. And if you actually create a histogram in Excel, Excel being a great tool, of course, to be able to put together these pictorial images. Excel will usually give you a pretty good approximation, but Excel is going to have to make some approximations to create the histogram. And one of the main things it has to do is think about how many boxes do we want down below and what's going to be the spread between the boxes. So in other words, we have our data on the left hand side. What we want to do is put that data into buckets so that we can then see how many items fall into each of those buckets. So then, of course, the question is, well, how spread out should the buckets be? What kind of buckets should we be making? And Excel, if we just highlight the data and insert a histogram, will make some of those assumptions. So here, for example, 55,000 to 58,400. So we only have one of the data points in here. Nothing is in the buckets of 58 all the way up to 65,200. And then when we go to 65,200 to 68,600, 65,200 to 68,600 somewhere around here, we've got five data sets and so on and so forth. Now, if we make some changes to our histogram, then it can make a big change in what the histogram looks like. And if there's a big change to what the histogram looks like, that could make changes in terms of our perception of what the histogram is telling us. So from a positive perspective, that means that we can alter some of the items on the histogram in order to get a better picture of what it is that we're trying to narrow in on. On the negative side of that, you can also imagine, and this is often what happens in practical, in practice, people using histograms to kind of support an opinion that they already have in place. So if someone has an opinion that they want to put some kind of policy in place in the corporation or something like that, then they might manipulate the look of a histogram or whatever pictorial representation they are using to better represent their argument. So what we want to do then is to be able to say, okay, how can we use the histogram when we're really trying to actually understand the data so we can zero in on a picture that is best representative from multiple angles, and how can we see how someone might try to kind of be deceptive with the manipulation of the histograms if they're trying to argue for a particular point. And of course, in order to do that, we have to put our mind in the mindset of someone who's kind of trying to be deceptive with the histograms so that we can guard against someone trying to be deceptive with any pictorial representations of data. Alright, so by default on the left hand side, this is kind of the information related to this x-axis, the buckets. The two major categories are going to be the bin width and the number of bins. Now, currently, we have it as automatic. So it's on automatic, which means these two are grayed out and Excel just put in 3400 for the bin width. So in other words, the difference between these two numbers, if I pull up the trusty calculator and we do some calculations with it, we're going to say that the endpoint 58400 minus 55 is going to be that 3400 distance between how big the bucket is. And then we have nine buckets. So we've got one, two, three, four, five, six, seven, eight, nine buckets. Now, if I change either of these, if I change the bin width, then the buckets will typically change automatically within Excel, because if I decrease the bin width, you would expect that we would need more buckets in order to clear the entire data set to populate the entire data set. However, we could lower the number of buckets by also using the overflow bin and the underflow bin. In other words, if I have the outliers at the end, at the tail ends here, we could try to trim off the outliers at the tail end. So let's see how these can be implemented. Here's another pictorial representation. What happened here? We changed the data, the bin, from automatic to bin width of 7000. So now if I look at the bin width, we have 62000 minus 55000, 7000 different. So we have a big bin, bigger bins here, big, big bin. And so it also automatically changed the number of bins from, what do we have up here, nine to five bins. Now you can notice that if you just look at the look and feel of these two graphs, they give you a pretty different look and feel. So you've got this one is still orientated towards the middle here, but you've got these outliers towards the end. And if I do it this way, it's really emphasizing this middle point because of the bin sizes. So if I go back down again and we say let's look at another one, and let's say that we then make the bucket sizes very small. So in this case, we took the bin width only to 200. So now it's way down to 200. So you've got 55000 to 55200, only a $200 difference between the bins. Well, that means you're going to need a whole lot of bins to cover the data from 55000 up to 84000. I have 145 bins in here. And it also means that when I look at the height of the graph, it's only going up to 3.5, whereas the height of this graph is going up to like 40. And the height of this graph is between, it's like 30 to 40. So notice the other thing to kind of keep in mind here is that is this y-axis. So because this y, see it might look like this bar is as tall as like this bar, but it's not because they scaled the y-axis in order to fit this set of data. So that's another thing that can be a little bit deceptive when we try to change the data. Now this one, we went a little bit extreme on, very extreme, on the number of bins to emphasize the point that more detail isn't always good, because of course I can put this all the way down to one, right? If I had been with of one, Excel might not let me do that, but you can do that in theory. And that would be like entering just a normal bar chart, right? We just basically had all of the numbers down here, and you would have very few of the numbers where more than one happened to hit exactly on that particular number. And that means the picture wouldn't be very useful to us in that case. So this is going to be most likely to spread out, and there's not a, you know, three in one bucket isn't, you know, the buckets don't seem large enough to really give us an idea of the data, although they still give us, you know, kind of a spread in this case. So then let's take a look at the next one, and this one, what we did was change the height. So just to emphasize the other thing that people could manipulate is you could change the height of the histogram. And so what that does then, now I've increased the Y up to 100. So even though this is still going up to like 30, you know, it's 25 to 30, it looks like a more squat histogram, because if I compare it to this one up here, where this one was going up to a similar region, this one looks a lot bigger because the Y went up to 30 as the maximum number, which makes sense because Excel is going to try to zoom in as much as possible on the data that is relevant. And in here, we have a whole space on the graph that isn't relevant. So Excel would normally kind of oftentimes trim that out by default. But if someone wanted to de-emphasize the height of this graph, you know, they might try to add, you know, you could try to, someone might try to add a little bit more on the Y axis and say, yeah, that's a pretty squat. It's still pretty, you know, it's a much more squat curve. If you look at the curve here versus this curve, like, look at that. It's a huge slope versus this one. You only got this little squat thing. Well, no, it's like the same. It's just that you changed the representation on the Y. And so this one, let's take a look at this one, where we did the starting point at 10 instead of zero. So if you started at 10 on the bottom, that's kind of weird for a histogram oftentimes because oftentimes you're going to start it at zero. But if you started at 10, then it kind of trims out all of the low points. See up here, we started at zero and then you had these ones over here. If you started at 10, this, this, this and this are all gone. And you're only starting at 10. Now, maybe that might be relevant sometimes. Like you might say, everything under 10 is irrelevant. So I'm just going to start it at 10, but that's kind of unusual. And clearly that's emphasizing the middle range and all of the lower numbers have been cut out. So that would be, you know, kind of a misrepresentation of this data you would think would be a little weird. But and then this one and this one, what did we do on this one? Oh, I see we see this one is where we did the overflow bins. So now this one up top, you'll remember the normal one had these these outliers on the end. Now, you might do this for legitimate reasons because you might say, hey, look, this the outliers often, of course, are the problem with the data set because when you look at an average, the outliers are thrown off the average. If you compare the average salary of people that live in your neighborhood and you happen to have a millionaire that one mansion that sits like in the neighborhood and everyone else makes, you know, normal salary, then then if you count that outlier, it's going to it's going to pull up the average a lot because the outlier is quite extreme. So the outliers are often an issue. So if you if you make the histogram, like what if these were 0000 bucket and then you had like a one way out here? Well, then it's going to really skew the look of your of your graph because you have to include that one way out here. And so to trim that out, you can use these you can use these this function. So now you can say, okay, the bandwidth is 2009 Ben same thing we had before. But now in the overflow bin, we're going to say everything that's above 77000 and everything under 65000 we trim those out. So you can see now we have a couple in this one because those two that were here, we kind of combined them together in that overflow bin. And that gives you a more. This is kind of the nicest looking one, you know, possibly of all of them because it kind of shows the data, you know, nice and trim in the middle. But it de emphasizes the outliers. We're not getting a good sense of what those outliers actually are because we've kind of pulled them in to everything over 77000. So if you have an idea of, well, I want to know how big that outlier actually was or whatever, then this is kind of removing that data. But again, you can see why sometimes that would be useful to do to get a pictorial representation that's kind of focused on the heart of the data. So clearly, when you look at different data that might have a different, you know, histogram to it, then as you adjust the bucket size, the main things that you could adjust the bucket size, the number of bins and the outliers, and then also you could play with the Y axis will have different kind of in fact packs on different data sets. So of course we have to be aware of which is why it's nice to look at the data from multiple different angles. It would be great to have multiple angles. So if someone is having a deceptive representation of the data, then you can call it out and say like, you know, that looks a little deceptive. Not trying to call you a liar or anything, but it kind of seems like you kind of seems like you're you did some funny stuff to the histogram. I'm not sure if we looked at it from a different angle, we'd get the same sense.