 Statistics and Excel, standard deviation, measuring spread. Got data? Let's get stuck into it, with statistics and Excel. Introduction, overarching objective, kind of like our mission statement. Confronting the challenge of taking a list of numbers and structuring them in a way that offers meaning. First, a word from our sponsor. Yeah, actually we're sponsoring ourselves on this one. Because apparently the merchandisers, they don't want to be seen with us. But that's okay, whatever. Because our merchandise is better than their stupid stuff anyways. Like this CPA thinking cap, for example. CPA thinking, CAP, you see what we did with like with the letters. And this CPA thinking cap is not just for CPAs either. Anyone can and should have at least one possibly multiple CPA thinking caps. Why? Because based on our scientific survey of five people, all of whom directly profit from the sale of these CPA thinking caps, wearing this CPA thinking cap without a doubt, according to the survey, increases accounting productivity tenfold. Yeah, at least. Yeah, apparently the hat actually channels like accounting energy from the quantum field ether directly into your head. Allowing you to navigate spreadsheets faster. It's kind of like how in like the matrix when Neo learns kung fu. Or at least that's what the scientific survey is saying. So get one. Because the scientific survey participants could really use some extra cash. If you would like a commercial free experience, consider subscribing to our website at accountinginstruction.com or accountinginstruction.thinkific.com. So if we have a long list of numbers, we want to be able to summarize those numbers in a way that we can extract some meaning from the data using two primary types of tools. Numerical summaries of the data as well as pictorial summaries of the data. Numerical summaries of the data, including our common statistical calculations, such as the mean or average quartile one, the median quartile two and three and so on, pictorial representations of the data, including things like our box and whiskers or box plot, as well as histograms. In this section, we're focused more on measures of dispersion or spread of data building on our measures of center. So in other words, many of our standard statistical calculations we focus in on our measures of center, the most common to being the average or mean and the median. We now want to be focusing more on the spread of the data around that center point, doing so not just with a visual representation such as a histogram or box and whiskers, but with a numerical representation such as the variance and standard deviation. So this lecture focusing on methods and principles when dealing with complete population data. Note, when we look at calculations of spread, the most common being the variance and standard deviation. There are slight differences when you're talking about the entire population, as opposed to when you're talking about a sample. And here we're going to start all talking about an entire population in future presentations. Then we will get into the sample. So measuring central tendencies. So we got into this a little bit in prior sections. So we'll kind of recap those central tendencies and then move on to the measures of spread. So we've got the mean or the average, the most common, the most famous type of calculation. Most of the time when people are trying to summarize data with one number, they're looking for the mean or the average. So definition, the sum of data divided by the number of the data items. So we're going to sum up all of the data divided by the number of the items. It's often denoted by an X bar on top, the X with the bar on top, or a mu, which looks like a U, it's the Greek letter mu, has a physical interpretation as the balance point of the data. So if we were to look at a histogram and put a fulcrum under it and balance the data on it like a teeter totter, the mean is that balancing point is affected significantly by outliers. So if we had outliers, such as we saw with some of our salary data set, if we then added the CEO salary, which was like a millions or millions of dollars, then that might have an impact or will have an impact on the mean. Whereas when we look at the median definition, the middle number in an ordered list. So if we list our numbers from smallest to highest, pick the one in the middle. Now we've got the median, a resilient against effects of outliers. That's one of the benefits of the median is that the big outliers not going to have a big impact on the median as it will oftentimes with the mean or average. All right, let's move on to the dispersion, dispersion and five number summary. So we took a look a little bit at the five number summary, which is kind of related to the box and whiskers or box plots in prior sections. They do give us kind of an idea of the spread. In other words, a simplistic approach to understanding spread is the five number summary where we're just going to take the data and break it out, right? We'll take the smallest point of the data, we'll take the first quartile of the data, we'll take the median of the data, the middle number, third quartile and the maximum. So this is a similar concept as simply taking the median, the middle number and the data set and breaking it out a little bit more than just taking the middle number, right? We'll take the smallest number, we'll take the first quartile, the first 25%, then the median, which is the second quartile, the third quartile and the maximum. So if you just have those five numbers, you do get a visual representation of kind of a spread and you can kind of, if you look at a histogram for example and you imagine those five numbers, you could say, okay, I kind of get an idea of the spread of the data. However, there are limitations to how much of a concept of spread you're getting with this five number summary. So we want to add another numerical representation to it, which is going to be the variance and the standard deviation ultimately. So this five number summary does not give a refined sense of where all the data lie. So they give you a rough, it's a good summary, a good tool, but we probably want more tools to get into that spread of the data. The histogram does offer visual insights about data distribution and dispersion. In other words, as we've seen in prior sections pretty in some detail, when we look at the histograms, we get a good visual representation. I can look at this five number summary and the histogram and I can say, okay, if I also calculated the mean, the fulcrum point would be the mean and then I can get an idea of where these five numbers kind of lie in our histogram. So the histogram is a great tool to get an intuitive sense of the spread of the data. However, we'd also like more tools to get the numerical spread of the data as well. And that's going to ultimately be once again, standard deviation and variance. But before we get there, let's first try to think this out in a more intuitive way, which we will do in practice problems as well. So you might say, hey, look, if I want more numbers than this five number summary to get an idea of the spread of the data, what might I do from an intuitive perspective? We might do something like an average deviation, which is a stepping stone to get to what is used most often in practice, standard deviation and the variance. So remember the average point, if I look at my histogram is the focal point at which there's an even point, if you think of this as like a teeter-totter. So if I want to get my spread, I might say, hey, look, why don't I take each of the data points represented by x here, x sub i, i equal 1 to n, all the data sets, take each data point, subtract mu, which is representing the mean or the average. So if I take each data point in my data set minus the fulcrum, the middle point, the average represented by mu, I'm going to get the distance from the middle point of each data point. Now, if I take that, what's going to happen is if I add up all that data, it's going to add up to zero because some of these are going to be higher and some are going to be lower and the property of the average means that we're going to end up with zero if I have positive and negative numbers. So you might then think the next thing to do would be to take the absolute value and that means that we're taking the distance from each data point to the average, but I don't care if it's higher or lower than the average. I'm not using positive and negatives. I'm just looking at the difference, whether the difference go to the right or go to the left, higher or lower just the distance. And then I'm going to take that distance and divide by the number of units and that would be the most intuitive thing that we might come to if we kind of mold it over. So one intuitive way to measure spread of data is to look at how far each datum is away from the mean. So this each datum from the mean represented by mu, then take the absolute value of the distance of each data from the mean, absolute value, take the average of those values dividing by n divided by n. The average distance from the mean is a potentially useful measure of dispersion, not the most commonly used measure. So although this leads into the most commonly used measures, variants, standard deviation, it's not the one most commonly used. You could use it, but not the one you'll probably be working with most of the time, delving deeper into dispersion. So now we have the variance and standard deviation definition quantifies how spread out the numbers are from the mean. So now we're moving from the average deviation formula to the variance and standard deviation. So you can see the similarities will go into the similarities and more detail between the average deviation taking the absolute value, whereas the variance is squaring and then taking the standard deviation is just taking the variance, which is everything under under here. And then we're going to take the square root of it. So these two are basically related. The variance is kind of a stepping stone to get to the standard deviation, which is why the variance often represented by sigma squared standard deviation, simply sigma. So the variance, let's go into it step by step, denoted by s squared or a sigma squared Greek letter sigma, average of the square difference from the mean. So we're going to say similar to what we had with the average deviation, where we're going to take each of the points and subtract it from the mean. Same thing we did before gives us the distance of each point from the mean, but instead of taking the absolute value, we're going to square it. Now the squaring of it has the same property of removing the negative numbers, which we need to do so that we can take the average distance. However, it also squares it, which means we're going to end up with a lot larger numbers, right? So now we're going to square it making everything positive, but also making them squared and then divide by n, and that's going to give us the variance. Now the variance is kind of an abstract number because it's going to be a very, it's going to be a larger number, but it in and of itself, especially when we're comparing different data sets like salaries in the U.S. versus salaries somewhere else in the world can be a telling factor oftentimes in comparative purposes. Even though when you look at it in and of itself, it might seem like a number that's not giving you a lot of value. But then the next step would be the standard deviation. So now you're simply just taking what you had for the variance and taking the square root of it transforming the variance, which represented by sigma squared to just sigma, the standard deviation. So same exact thing except now we're taking the square root. So it's kind of like we squared it and then we took, then we kind of removed the squaring of it by taking the square root kind of. You're not going to get to the same number as we got with the average deviation, but you can see a similar kind of process here in that with the average deviation, we took the absolute value to deal with that negative number problem here. We squared it and then basically took the square root. All right. And we'll talk more about why we might use this, which looks more complex than the average deviation in a second. So standard deviation, square root of the variance. So we just took the variance and then took the square root of it gives the average distance data points are from the mean. So average distance from the mean values will be larger if the data set is more widely spread and smaller if the data are close to each other. So again, both of these numbers often seem a little bit more abstract, but if you're comparing different data sets, it becomes apparent because you're going to say, well, if the standard deviation is larger, you would expect more spread in the data from the middle point from mu the mean. If it's smaller, you would expect the data points to be more compact around that middle point. So they are affected by outliers. So if there's a big outlier in the data set, notice we're comparing to the mean, the middle point. So if the mean is impacted by outliers, you would think then it would also be the case that both standard deviation and variance would be impacted by outliers as well. So we have to keep that into consideration when we're dealing with outliers. Basically, the square root of the average square distance from data points to the mean, note for samples n minus one is used as the denominator to account for degrees of freedom. So we're dealing with a population here. You might see a similar formula when dealing with standard deviation, but in the denominator, you've got n minus one. That's because it's a difference between taking the standard deviation for the entire population where we have all the data for the entire population versus a sample where it's a sample of data in the population. We'll talk more about that in future presentations right now in this section. We're generally focused on data, which we are imagining to be the entire population. All right, let's get it back into this question of why square the differences. So back into the question of why don't we just use our average deviation? If my problem up top is that when I take each data point minus the middle point or mean, that results in negative numbers and I need to get rid of the negative numbers so I can sum up the differences from the mean. Why not just take the absolute value instead of squaring it and then, in essence, taking a square root of it? And one reason is that the population mean then a unique value that minimizes the sum of the square differences. In other words, it has a characteristic to it that it's going to come up with a unique number. So we'll show this in one of our example problems. But when asked why we square the data, most people will tell you that you do that because that gets rid of the negative numbers and you need to get rid of the negative numbers. But then the question, of course, is, well, why don't you just take the absolute value because that also gets rid of the negative numbers and is easier because you don't have to square it and then take the square root. And in mathematics normally, we want things to be as easy as possible, removing any excess steps so that we get down to the simplest kind of formula that we can apply to a particular situation. So you would think there's got to be a reason why we would do something that's more complex. And basically, if we were to take some focal point other than the average here, for example, and I use just some other number as the middle point and compare to it, we could end up with the same number when I take the average deviation. Whereas if I take this method, the standard deviation, I come up with a unique number when using the average in this slot as opposed to some other number I picked as well. So that might not be completely necessary to understand to do the calculations, but that question often comes up and so it's kind of useful to get an intuitive understanding. We'll work a practice problem related to that. So implications and applications. So comparing dispersion in different contexts. So notice that once you have these numbers, if we have to have the data sets in comparison to actual reality to be drawing meaning from the data set. So for example, if we're dealing with salaries and large corporations in different countries and we had the data sets for these different countries and we were measuring some of our statistical tools such as the central points median and mean as well as the dispersion, standard deviation and variance, it might give us some implications about the different strategies of incentives and compensation from the different countries, right? We might be able to draw conclusions from that data set. But of course, we need to know the contexts of the data sets in order to be drawing the conclusion. We need to know that their data sets about salary related to one company versus another country that might have different strategies around compensation. And then of course, when we get the data, we possibly can draw conclusions around of that nature. So inferring meaning, while statistics provides valuable tools, it's the application and understanding of the context that brings deeper meaning. Data must be interpreted within its context. Now notice also, when we deal with data within context, we also are going to inevitably be dealing with some kind of politics around it as well, whether that be corporate politics or other, you know, government politics or everybody's got, you know, their biases that are involved and that often again leads people to go to the old quote of, you know, lies in statistics, right? As if the statistics are at fault if there's kind of misleading data. So we have to properly be able to represent the context because again, it's not the statistics fault. The statistics are just the numbers. They are the stats. If the context around the statistics are being misrepresented, then we have to get down to the misrepresentation of the statistics of the context just like we would if people misrepresent something in words, right? It's not the words that are the problem if people are using words improperly, misdefying words, making up new words, saying words that mean one thing and acting like they mean another. It's not the words fault. The words aren't at fault here. It's the people that are lying with the words. So the same thing is the true with the statistics here. So we have to keep that in mind. They're just a tool. So summary, the mean and the median, although useful, don't tell us anything about how widely spread the data are. So remember that most of the time those first central tendency numbers are the ones we look at, the median and the mean, but we also, it's going to be quite useful to know the spread of the data which we can visualize with a histogram but also would like more of a numerical representation. A histogram gives a good visual sense of the distribution but not a summarized numerical one. So the histogram is great, but we'd also like to have a numerical representation. The five number summary you might say and associated box plot gives some sense of how the data are spread out but can sometimes be misleading. We'll do some examples to show that. So you might say, hey, the five month number summary gives me a nice picture of the spread of the data to some degree, but we'll actually show an example to show where it falls short sometimes where we have two very different data sets which actually result in the same five number summary and box and whiskers plot as well. So the standard deviation is a numerical measure of roughly how far the data are on average from the mean. So when we look at that standard deviation, remember that's kind of the idea of it. You've got the middle point, the mean, the focal point if you're looking at the histogram and you're trying to think about the average distance from that focal point with the standard deviation calculation. Now, remember that the standard deviation and the variance can be a little bit more abstract of terms. In other words, when we think about the mean or the median and even the five number summary, the data in and of itself is usually enough for us to kind of grasp what it's telling us about the data to some degree. Whereas when we get into the standard deviation and variance, they can be a little bit more abstract. So working through practice problems and using different data sets and again, kind of getting an idea of the context is often useful, especially in a term like the variance, for example, can look like a very abstract number, but it can be a useful term when we're comparing different data sets. So we'll work some practice problems in this section and we'll continue on with these concepts in future sections.