 As a final step of statistics and exploring data, I'm going to talk about something that's not usually considered exploring, but is basic descriptive statistics. I like to think of it this way. You've got some data and you are trying to tell a story, more specifically, you're trying to tell it, your data's story. And with descriptive statistics, you can think of it as trying to use a little data to stand in for a lot of data, using a few numbers to stand in for a large collection of numbers. And this is consistent with the advice that we get from good old Henry David Thoreau, who told us simplify, simplify. If you can tell your story with more carefully chosen and more informative data, go for it. So there's a few different procedures for doing this. Number one, you want to describe the center of your distribution of data. That's if you're going to pick a single number, use that. Two, if you can give a second number, give something about the spread or the dispersion of the variability. And three, it's also nice to be able to describe the shape of the distribution. Let me say more about each of these in turn. First, let's talk about center, we have the center of our rings here. Now there are a few very common measures of center or location, or central tendency of a distribution. There's the mode, and there's the median, and there's the mean. Now there are many, many others, but those are the ones are going to get you most of the way. Let's talk about the mode first. Now I'm going to create a little data set here on a scale from one to 11. And I'm going to put individual scores. There's a one and another one and another one and another one. Then we have a two, two, then we have a score way over at nine and another score over at 11. So we have eight scores. And this is the distribution. This is actually a histogram of the data set. The mode is the most commonly occurring score or the most frequent score. Well, if you look at how tall each of these go, we've got more ones than anything else. And so one is the mode, because it occurs four times and nothing else comes close to that. The median is a little different. The median is looking for the score that is at the center if you split it into two equal groups. We have eight scores. So we want to get one group of four. That's down here. And then the other group of four is this really big one, because it ranges way out. And the median is going to be the place on the number line that splits those into two groups. That's going to be right here at one and a half. Now, the means a little more complicated, even though people understand means in general, is the first one we have here that actually has a formula where m for the mean is equal to the sum of x. That's our scores on the variable divided by n, the number of scores. You can also write it out with Greek notation if you want like this, where that's sigma, a capital sigma is the summation sign sum of x divided by n. And with our little data set that works out to this one plus one plus one plus one plus two plus two plus nine plus 11. Add those all up and divide by eight, because that's how many scores there are. Well, that reduces to 28 divided by eight, which is equal to 3.5. If you go back to our little chart here, 3.5 is right over here. You'll notice there aren't any scores really exactly right there. That's because the mean tends to get very distorted by outliers. It follows the extreme scores. But a really nice, I say it's more than just a visual analogy, is that if this number line were a seesaw, then the mean is exactly where the balance point or the fulcrum would be for these to be equal. People understand that if somebody weighs more, they got to sit in closer to balance somebody who weighs less who has to sit further out. And that's how the mean works. Now let me give a little bit of the pros and cons of each of these for the mode modes really easy to do you just count how common it is. On the other hand, it may not be close to what appears to be the center of the data. The median, it splits the data into two same size groups the same number of scores in each. And that's pretty easy to deal with. But unfortunately, it's hard to use that information in many statistics after that. And then finally, the mean of these three is the least intuitive. It's the most effective by outliers and skewness. And that may really struck against it, but it is however the most useful statistically. And so it's the one that gets used the most often. Next, there's the issue of spread, spread your tail feathers. And we have a few measures here that are very common also. There's the range. There are percentiles and interquartile range. And there's the variance and the standard deviation. I'll talk about each of those. First, the range, the range is simply the maximum score minus the minimum score. And in our case, that's just 11 minus one, which is equal to 10. So we have a range of 10. And I can show you that here on our chart. It's just that line there at the bottom from the 11 down to the one. That's a range of 10. The interquartile range, which actually is usually referred to simply as the IQR is the distance between Q3, which is the third quartile score and Q1, which is the first quartile score. If you're not familiar with quartiles, it's the same as the 75th percentile score and the 25th percentile score. Really, what it is, is you're going to throw away some of the data. So let's go to our distribution here. First thing we're going to do is we're going to throw away the two highest scores, there they are, they're grayed out now. And then we're going to throw away two of the lowest scores. They're out there. And then we're going to get the range for the remaining ones. Now, this is complicated by the fact that I've got this big gap in between two and nine. And different methods of calculating quartiles do something with that gap. So if you use a spreadsheet, it's actually going to do an interpolation process, and it will give you a value of 3.75, I believe, and then down to one for the first quartile. So not so intuitive with this graph, but that is how it works usually. If you want to write it out, you can do it like this. The interquartile range is equal to q3 minus q1. And in our particular case, that's 3.75 minus one. And that, of course, is equal to just 2.75. And there you have it. Now our final measure of spread or variability or dispersion is two related measures, the variance and the standard deviation. These are a little harder to explain a little harder to show. But the variance, which is at least the easiest formula is this, the variance is equal to that's the sum, the capital sigma is the sum of x minus m, that's how far each individual score is from the mean. And then you take that deviation there, and you square it, you add up all the deviations and then you divide by the number. So the variance is the average square deviation from the mean. I'll try to show you that graphically. So here's our data set. And there's our mean right there at three and a half. Let's go to one of these twos, we got a deviation there of one and a half. And if we make a square, that's one and a half points on each side. Well, there it is, we can do a similar square for the other score to if we're going down to one, then it's going to be two and a half squared and that can be that much bigger. And we can draw one of these squares for each of our eight points. The squares for the scores at 911 are going to be huge and go off the page. I'm not going to show them. But once you have all those squares, you add up the area and you get the variance. So this is the formula for the variance. But now let me show the standard deviation, which is also a very common measure is closely related to this. Specifically, it's just the square root of the variance. Now, there's a catch here. The formulas for the variance and the standard deviation are slightly different for populations and samples and that you they use different denominators. But they give similar answers, not identical, but similar if the sample is reasonably large say over 30 or 50, then it's going to be really just a negligible difference. So let's do a little pro and con of these three things. First, the range is very easy to do. It only uses two numbers, the high and the low, but it's determined entirely by those two numbers. And if they're outliers, you've got really a bad situation. The interquartile range or IQ are really good for skewed data. And that's because it ignores extremes on either end. So that's nice. And the variance in the standard deviation, while they are the least intuitive, and they are the most affected by outliers, they are also generally the most useful because they feed into so many other procedures that are used in data science. Finally, let's talk a little bit about the shape of a distribution. You can have symmetrical or skewed distributions, unimodal uniform or u shaped, you can have outliers. There's a lot of variations. Let me show you a few of them. First off is a symmetrical distribution. Pretty easy. They're the same on the left and on the right. And this little pyramid shape is an example of a symmetrical distribution. They're also skewed distributions where most of the scores are on one end and then they taper off. This right here is a positively skewed distribution where most of the scores are at the low end and the outliers are on the high end. This is unimodal. It's our shame pyramid shape. Unimodal means it has one mode or really kind of one hump in the data. That's contrasted, for instance, to bimodal where you have two modes. And that usually happens when you have two distributions got mixed together. There's also uniform distributions where every response is equally common. There's u shaped distributions where people tend to pile up at one end or the other in a big dip in the middle. And so there's a lot of different variations. And you want to get those the shape of the distribution to help you understand and put the numerical summaries like the mean and like that. And you want to get the standard deviation and put those into context. In some we can say this when you use descriptive statistics that allows you to be concise with your data. Tell the story and tell it succinctly. You want to focus on things like the center of the data, the spread of the data, the shape of the data. And above all, watch out for anomalies because they can exercise really undue influence on your interpretation. But this will help you better understand your data and prepare you for the steps that follow.