 Statistics and Excel. Standard deviation, measuring spread. Got data? Let's get stuck into it with statistics and Excel. Introduction. Overarching objective kind of like our mission statement. Confronting the challenge of taking a list of numbers and structuring them in a way that offers meaning. So if we have a long list of numbers we want to be able to summarize those numbers in a way that we can extract some meaning from the data using two primary types of tools, numerical summaries of the data, as well as pictorial summaries of the data, numerical summaries of the data, including our common statistical calculations, such as the mean or average quartile one, the median quartile two and three and so on, pictorial representations of the data, including things like our box and whiskers or box plot, as well as histograms. In this section, we're focused more on measures of dispersion or spread of data building on our measures of center. So in other words, many of our standard statistical calculations we focus in on our measures of center, the most common to being the average or mean and the median. We now want to be focusing more on the spread of the data around that center point, doing so not just with a visual representation, such as a histogram or box and whiskers, but with a numerical representation, such as the variance and standard deviation. So this lecture focusing on methods and principles when dealing with complete population data. Note, when we look at calculations of spread, the most common being the variance and standard deviation, there are slight differences when you're talking about the entire population, as opposed to when you're talking about a sample. And here we're going to start all talking about an entire population in future presentations, then we will get into the sample. So measuring central tendencies. So we got into this a little bit in prior sections. So we'll kind of recap those central tendencies and then move on to the measures of spread. So we've got the mean or the average, the most common, the most famous type of calculation. Most of the time when people are trying to summarize data with one number, they're looking for the mean or the average. So definition, the sum of data divided by the number of the data items. So we're going to sum up all of the data divide by the number of the items. It's often denoted by an X bar on top, the X with the bar on top, or a mu which looks like a U is the Greek letter mu has a physical interpretation as the balance point of the data. So if we were to look at a histogram and put a fulcrum under it and balance the data on it like a teeter totter, the mean is that balancing point is affected significantly by outliers. So if we had outliers, such as we saw with some of our salary data set, if we then added the CEO salary, which was like a millions or millions of dollars, then that might have an impact or will have an impact on the mean. Whereas when we look at the median definition, the middle number in an ordered list. So if we list our numbers from smallest to highest, pick the one in the middle. Now we've got the median resilient against effects of outliers. That's one of the benefits of the median is that the big outlier is not going to have a big impact on the median as it will oftentimes with the mean or average. All right, let's move on to the dispersion, dispersion and five number summary. So we took a look a little bit at the five number summary, which is kind of related to the box and whiskers or box plots in prior sections. They do give us kind of an idea of the spread. In other words, a simplistic approach to understanding spread is the five number summary, where we're just going to take the data and break it out, right? We'll take the smallest point of the data. We'll take the first quartile of the data. We'll take the median of the data, the middle number, third quartile and the maximum. So this is a similar concept as simply taking the median, the middle number and the data set and breaking it out a little bit more than just taking the middle number, right? We'll take the smallest number. We'll take the first quartile, the first 25%, then the median, which is the second quartile, the third quartile and the maximum. So if you just have those five numbers, you do get a visual representation of kind of a spread and you can kind of, if you look at a histogram, for example, and you imagine those five numbers, you could say, okay, I kind of get an idea of the spread of the data. However, there are limitations to how much of a concept of spread you're getting with this five number summary. So we want to add another numerical representation to it, which is going to be the variance and the standard deviation ultimately. So this five number summary does not give a refined sense of where all the data lies. So they give you a rough, it's a good summary, a good tool, but we probably want more tools to get into that spread of the data. The histogram does offer visual insights about data distribution and dispersion. In other words, as we've seen in prior sections pretty in some detail, when we look at the histograms, we get a good visual representation. I can look at this five number summary and the histogram and I can say, okay, if I also calculated the mean, the fulcrum point would be the mean, and then I can get an idea of where these five numbers kind of lie in our histogram. So the histogram is a great tool to get an intuitive sense of the spread of the data. However, we'd also like more tools to get the numerical spread of the data as well. And that's going to ultimately be, once again, standard deviation and variance. But before we get there, let's first try to think this out in a more intuitive way, which we will do in practice problems as well. So you might say, hey, look, if I want more numbers than this five number summary to get an idea of the spread of the data, what might I do from an intuitive perspective, we might do something like an average deviation, which is a stepping stone to get to what is used most often in practice, standard deviation and the variance. So remember the average point, if I look at my histogram is the focal point at which there's an