 Statistics and Excel. Wages, data, box plot or box and whiskers. Got data? Let's get stuck into it with statistics and Excel. First, a word from our sponsor. Yeah, actually we're sponsoring ourselves on this one because apparently the merchandisers, they don't want to be seen with us. But that's okay whatever because our merchandise is better than their stupid stuff anyways. Like our crunching numbers is my cardio product line. Now, I'm not saying that subscribing to this channel, crunching numbers with us, will make you thin, fit and healthy or anything. However, it does seem like it works for her, just saying. So, subscribe, hit the bell thing and buy some merchandise so you can make the world a better place by sharing your accounting instruction exercise routine. If you have access to this online one note, we're in the icon left hand side, the one note and Excel presentations, 10, 20 wage box plot tab. We will also try to upload our transcripts so that if you choose to, you could go to the view tab up top, use the immersive reader tool. You can change the language here if you so choose and either read or listen to it in multiple languages. One note desktop version here looking at our information, our data on the left hand side, which we are imagining to be salary income information, possibly related to employees of a corporation or a business or possibly related to faculty at a school if you want to imagine it that way. So note that usually when we first get the data, it's not going to be organized in a fashion that's going to be very useful to us. So for example, if we did a random sample, if we asked people what their salary is, if they were willing to give us what the actual salary was, then we might order the information by the people that we randomly asked what their salary was, or we might get the information if we have access to the salaries by alphabetical order. We might have a list of our employees in alphabetical order and the salaries related to them. Clearly, when we just look at the data set in that way, we might be able to extract some information from it, but it becomes quite difficult to do so because there's no ordering of the data. It's going to be difficult to get meaning from that data. And clearly, if you're in a situation with salaries, the common situation would be that you're trying to argue that you should have more salaries to have a larger income level. So you might look up data if you had access to the actual salary level in your particular organization, or you might look up salary related to people in your profession and say, hey, look, this is the comparative salary. This is what I'm making and so on and so forth. But clearly, if you were to present something in this order, it wouldn't be that useful, right? If you had a list of people's salaries and you presented in a meeting and tried to argue that you should get a raise because of this list of salaries, then this wouldn't be the most efficient way to present that list if it was in alphabetical order or something like that. What you would want to do typically is organize this and the first way to organize often will be from lowest to highest or highest to lowest. So in Excel, we will do this by basically making a table. We have a nice sort field so we can do this sorting very easily at one time. It was not so easy. It's great that we can easily sort this information now. So now we've got it from highest to lowest, which is probably what we're going to be looking for. We're going to be saying, hey, look, this person's making 84,000 and whatever and so on and so forth. And we can get a decent idea of the range by simply looking at it in order. Now, that's clear when you look at this kind of data, but if you are looking for a huge set of data that has a lot more information in it, then this would still be fairly difficult to look at and extract a lot of meaning from it. So the next thing to do is usually to make some kind of pictorial representation of the data. So the pictorial representations very important. They're not simply just a crutch type of tool to explain stuff to the layman. We want to have the pictorial type of data. Now the box and whiskers or box plot is a fairly easy kind of thing to put together. And I think one of the attractiveness of the box plot versus what we'll take a look at later, which will be a histogram, another way to sort the data, possibly one that you'll see more commonly because it gives you another sense or a different sense of the spread of the data. But I think the box plot was kind of easier to draw. You had to draw something by hand, you know, it might be a little easier to draw the box and whiskers than the histogram. But now, of course, we have Excel, so we can fairly easily draw whatever we want to draw. So we can make the box and whiskers if we so choose, and we can make the histograms fairly easily if we so choose and we can make bar charts, which we will practice in Excel. So if we were to plot this data, we could we could see what we have here. We've got the wages, and we have to come up with what we want to be the intervals of the wages. So it starts at 50, 55, 60, 65, 70, 75, 80, 85. And then we've got our information on the box and whiskers for the pictorial representation. So we actually put the data in here as well, which you can do in Excel. So now let's kind of analyze each of these points of the box plot. That's going to be our next kind of component. So I'm going to make this a little bit smaller here. So here's our same box plot on the right. And let's look at the meaning of some of these numbers in the box plot. So usually we want and we have the average or mean. That's one of the first types of calculations that we will typically do. So if I think about the average, how would we calculate the average of this set of numbers? Now, if you had to do this old school with a calculator, it would be quite tedious. But the concept is fairly straightforward. We're going to add up all of these numbers, boom, boom, boom, boom, boom. And then we're going to divide it by the total number of numbers that we have. We'll count the numbers and divide by the total. So if I was to do that, the manual calculation, I can add up all of the numbers and then divide it out. So I can think about that with two functions, the sum function, which would add up all the numbers and then divide represented by the slash here. And then this count function actually simply counts the numbers. So if you're interested in that in Excel, you can take a look at a problem in Excel. And then if I wanted to use an adjust a function, the function is the average function. So in Excel, I can just use the average function and then select this entire column of numbers. It'll give me the average. Remembering the average is the mean same thing towards same concept. So that's going to be how we can calculate the average that's represented over here with the X. Now note that the X is not exactly the same thing as the line. The X is the average, the line represents the median, the middle point of the number. So let's look at the next one. The minimum not included, not including outliers. So the minimum number not including the outliers, we've got 67.9. That's going to be this one right here, bottom of the box and whiskers, not including the outliers. The outlier being this 55,000. So we have to come up with some kind of rule and we're going to use kind of an arbitrary rule here to determine what should be basically an outlier that's going to go outside of the box and whiskers. Clearly outliers are going to be quite important because they can greatly distort numbers like the mean or the average. Now if I was going to look at this list of numbers and just figure out what that number is, 67,000, I can organize the numbers from lowest to highest or highest to lowest and there's the 67.9 right here, not including the outlier. See the outliers going over. You can also use the min function. Now the min function in Excel will choose the smallest number. So if I was to use the min function, I'd have to select all of this data, not including the outlier and then it would take the smallest number and that's another way that you can do it within Excel just to practice that min function. Alright then we have the first quartile excluding the median. So the first quartile excluding the median. So we went from the bottom of the whisker to the bottom of the box. So here's the first quartile at the 69,700. Before we further explain and dig down on the quartile one, it's easier to move down to the median or quartile two. Two names for the same thing, the median or quartile two where we are at the 70,900. That's represented on the box and whiskers by this line, which is different than the x. The x representing the average or mean, the line representing the median or quartile two. Notice that those two key terms, which are most often used, the average or the mean and the median or quartile two, both have those two kind of names to them. But most of the time people are going to use the median as the term. So what does that mean? It's kind of like if you've seen the Rocky movies where the advice to Rocky, the boxer, is you say, and I see three of them out there and the coach says you got to hit the one in the middle, right? That's what the median is, hit the one in the middle. So when we look at the average or mean last time with the x, that was when we added them all up and then we divided by the count, which in this case was 51. When we take a look at the median, we hit the one in the middle. So now this data is ordered from lowest to highest. Here's the count on the right hand side. If I look at simply the count at 51 and divide by two, we're at 25.5, so the middle number is that 7900, right? The 7900, which means above it, you've got 29 numbers above it from 1 to 25. And below it, you've got 51, the count here, minus 26. You've got 25 below it, right? So the one that's exactly in the middle is the 70,900. So remember, that's not always going to be the same. It often will not be the same as what the mean or average calculation will be. This is two different ways we're trying to find that center point. So if we did that with a formula, there's actually two formulas you can do in Excel. You can use the quartile formula, which would be equal quartile, and then you're going to be picking the data set here. So this is the array than the quartile, and it's quartile two to represent, two represents quartile two. Or more commonly, you would use simply the median function. So you just select the median function and then take this range of data, and Excel will pick the one in the middle for you. All right, so now that we know that, it's easier to go back into quartile one, which is this 69,700 again. So what does that mean? Well, we're breaking it out into quartiles. So last time we broke out the middle, that's the middle one. So what is quartile one going to be? It's going to be, I'm going to take everything that goes down to this line and take the one in the middle. Now you're going to ask, you might get, if you get technical on this, you're going to say, well, do I go from one to 26, and then take the one in the middle, including the median, or do I exclude the median? And the default in Excel is typically to exclude the median. So we're going to go from one to 25, excluding the median, and then take the one in the middle, right? So if I take the 25, if I take the 25 divided by two, we're at, you know, 12, 5, there's the 69, 7 in the middle. And that means that there's going to be 12 above it. So you've got numbers one to 12 above it. And then below it, we could just count them one, two, three, four, five, six, seven, eight, nine, 10, 11, 12, not including the 26, the median right here, because it is excluded. Now you could do the quartile count where you include the median, but the default is to exclude it. So you've got to kind of keep that in mind when you're doing your calculations. All right, so we're going to go back up now. We were now into the middle number. And then we of course have quartile three. So that's going to be the top of the box. So that's the 728, of course, and we can think about how to do that. If we had a manual calculation, then we'd have to say, okay, this is the middle point, the median quartile two, and this is the end point. So now we're taking the middle number between the middle here and the end for the next quartile. And so that would be then the Q3 excluding the median. And of course there's a function for it. And that would be the quartile function again in Excel, same as this function up top, choosing this number set. And then with a comma, the argument now being a three, because we're in quartile three. And then we've got the max point, which is the 24, the 24, which is the top of the whisker. And that does not include the outliers. So if I was to do that manually, I can then scroll down and I can sort my data. These two, we've declared to be outliers. So there's the 24, which is the top of the whisker. And if we were to do a formula for it, you can use a max function, which is a common function, you could use the max and then select this data. But you would have to select the data that doesn't include the outliers so that it picks as the max of the number not including the outliers. All right. So now we've explained kind of this whole box in whisker. And if you didn't have any outliers, that's all that you would have. But here we have these outliers. So then the next question is, well, what does it mean to be an outlier? Now, this description of an outlier is somewhat arbitrary. We're doing somewhat of a heuristic here to say, okay, what would make a number outside the general range? We're going to determine it as an outlier. Remember, outliers are really important because, like, for example, if you were trying to determine the average wealth of somebody in a particular group and you had one individual who happened to be a billionaire in it, then that billionaire is going to pull up the average of the entire group. It'll look like everybody's looks a lot more wealthy than they are because that outlier could possibly skew the data greatly depending on, you know, what the data set is looking at. So we have to be careful of the outliers. So enter quartile range. This is the IQR. This is going to help us to determine the outliers. We're going to just take the quartile 3 minus quartile 1. So this is going to help us to determine where they're coming up with these outliers. So if we take quartile 3 minus quartile 1, quartile 3 is the 7, 2, 800 minus quartile 1, 6, 9, 700. The difference is that 3,100, which we're calling the interquartile range. The lower quartile limit, then, is quartile 1 minus the IQR, the interquartile range, times 1.5, which is somewhat arbitrary, right? So that's going to be our calculation, though. So we're going to say, all right, that's what it's calculating. So now we're taking the Q1. So Q1 is, well, let's do it. Let's do the inter part first. The IQR, which is the 3,100 times 1.5 minus Q1 minus Q1, Q1 minus 6, 9, 7. And that gives us our 6550. So 65,050. So we're on the lower limit. So anything around that's kind of over this line, they're going to add in a dot format. So that's going to be the outlier. And then down here, we have a similar calculation for that upper limit where we have the quartile 3 plus the IQR, the interquartile range, times 1.5. So if I take the IQR, the 3,1 times 1.5, and then I'm going to add it this time to Q3. So plus Q3, 7 to 800, that gives us our 77,450, which means that if anything's above like the 77, somewhere around here, it's going to put a dot on them. So that's why these two are outlying. There's the 55 outlier, the 84, the 80 are going to be on the other side, 80 and 84 as outliers in our data. All right, and then just note that in Excel, it's possible to line up two box plots on one chart. So if we had two sets of data, similar sets of data, basically we've just added a constant amount to the second data range. And now just note that you can do a nice visual comparison between the two sets of data. I can say, okay, this one compared to that one, and we can get an idea of the averages and the medians and so on and what kind of outliers. These ones happen to mirror each other very closely but just staggered due to the fact that I created the second data list by just basically taking the first times like 1.1 or something. And then you can make your legend on the right-hand side. So notice when you get comparison of multiple data sets like this, then seeing it visually will probably give you a much better understanding than even if you were to list your stats. Like you might say, hey, look, to list those two data sets, I'll just list my stats side by side. That's helpful, but it's still a little bit hard to see, right? If you have these two kind of next to each other, then that gives you a pretty good representation about what's happening. I mean, if I looked at, for example, if I looked at the stats side by side for the average, the Q1, Q2, Q3 and so on and so forth, I might not get an idea that it looks like the data set has just been basically multiplied by like 1.1. But if I look at this, I can say, hey, that looks pretty symmetrical and it moved up like the same and the outliers moved up in alignment with it. So you see how the pictorial representation could give you an idea beyond the stats, oftentimes, or beyond the data set itself.