 Statistics and Excel. Histograms with different bucket sizes. Got data? Let's get stuck into it with statistics and Excel. You're not required to, but if you have access to OneNote we're in the icon on the left hand side. OneNote and Excel presentations tab of 1040 histogram with different bucket sizes. We're also attempting to upload our transcripts to OneNote. You can use the immersive reader tool to change the language of the transcripts and then either read through or actually listen to the transcripts in multiple languages using the timestamps to tie in to the video presentations. Desktop version of OneNote here. The information, the data on the left hand side. We are imagining to be annual salary or annual income of employees at say a corporation. Same data sets we've been taking a look at in prior presentation. Instead of having that data ordered by like alphabetical order of employees, first thing we typically do is sort that data either from lowest to highest or highest to lowest. That gives us a general sense of the data and then we've been talking about pictorial representations. The major two being the box and whiskers and the histogram. We're now focused on the histogram which gives us that nice spread of the data and we looked at a histogram in a prior presentation and if you actually create a histogram in Excel, Excel being a great tool of course to be able to put together these pictorial images, Excel will usually give you a pretty good approximation but Excel is going to have to make some approximations to create the histogram and one of the main things it has to do is think about well how many boxes do we want down below and what's going to be the spread between the boxes. So in other words we have our data on the left hand side. What we want to do is put that data into buckets so that we can then see how many items fall into each of those buckets. So then of course the question is well how spread out should the buckets be? What kind of buckets should we be making and Excel if we just highlight the data and insert a histogram will make some of those assumptions. So here for example 55,000 to 58,400 so we only have one of the data points in here. Nothing is in the buckets of 58 all the way up to 65,200 and then when we go to 65,200 to 68,600, 65,200 to 68,600 somewhere around here we've got five data sets and so on and so forth. Now if we make some changes to our histogram then it can make a big change in what the what the histogram looks like and if there's a big change to what the histogram looks like that could make changes in terms of our perception of what the histogram is telling us. So from a positive perspective that means that we can alter some of the items on the histogram in order to get a better picture of what it is that we're trying to narrow in on the negative side of that you can also imagine and this is often what happens in practical in practice people using histograms to kind of support an opinion that they already have in place. So if someone has an opinion that they want to put some kind of policy in place in the corporation or something like that then then they might manipulate the look of a histogram or whatever pictorial representation they are using to better represent their argument. So what we want to do then is to be able to say okay how can we use the histogram when we're really trying to actually understand the data so we can zero in on the on a picture that is best representative from multiple angles and how can we see how someone might try to kind of be deceptive with the manipulation of the histograms if they're trying to argue for a particular point and of course in order to do that we have to put our mind in the mindset of someone who's kind of trying to be deceptive with the histograms so that we can guard against someone trying to be deceptive with any pictorial representations of data. All right. So by default on the left hand side this is kind of of the information related to this x-axis the buckets the two major categories are going to be the bandwidth and the number of bins. Now currently we have it as automatic so it's on automatic which means these two are grayed out and Excel just put in three thousand four hundred for the bandwidth. So in other words the difference between these two numbers if I pull up the trusty calculator and we do some calculations with it we're going to say that the endpoint fifty eight four hundred minus fifty five is going to be that three thousand four hundred distance between you know how big the bucket is and then we have nine buckets so we've got one two three four five six seven eight nine buckets. Now if I change either of these if I change the bend width then the buckets will typically change automatically within Excel because if I decrease the number the bandwidth you would expect that we would need more buckets in order to clear the entire data set to populate the entire data set. However, we could lower the number of buckets by also using the overflow bin and the underflow bin. In other words, if I have the outliers at the end at the tail ends here we could try to trim off the outliers at the tail end. So let's see how these can be implemented. Here's another pictorial representation. What happened here? We changed the data the bin from automatic to bend width of seven thousand. So now if I look at the bend width we have sixty two thousand minus fifty five thousand seven thousand different. So we have a big bend bigger bends here big big bend and and and so it also automatically changed the number of bends from what do we have up here nine to five bends. Now you can you can notice that if you just look at the look and feel of these two graphs they give you a pretty you know different look and feel. So you've got this one you know is is still orientated towards the middle here but you got these outliers towards the end. And if I do it this way it's really emphasizing you know this middle point because of the bend sizes. So if I if I go back down again and we say let's look at another one and let's say that we we then make the bucket sizes very small. So in this case we took the bend width only to two hundred. So now it's way down to two hundred. So you got fifty five thousand to fifty five two hundred only a two hundred dollar you know difference between the bends. Well that means you're going to need a whole lot of bends to cover the data from fifty five thousand up to eighty four thousand. I have a hundred and forty five bends in here and it also means that.