 Here we continue our lecture on descriptive statistics. Welcome to the lecture. This is just to remind you of what you already saw in the first part of the descriptive statistics lecture. We are looking at measures of location. We did that already. Now we start and we continue. We look at measures of dispersion, measures of shape. And looking at different ways to combine measures together. Some of that is actually going to be done in standardizing data, which will be in another lecture. And working with data from two variables, which is not here, will be in another lecture. This is all a single variable. So we still have a lot to do. Why do we need to look at measures of dispersion? Maybe mean is enough. Well, this example will show you why you want measures of dispersion. You're a company and you want to buy computer chips. And they've got to have an average life of at least 10 years. You have a choice of two suppliers. We'll call one supplier A, supplier B. If you're smart, you take a random sample of chips. In this case, they took a sample of 10. And look at the data. The data is right there. You can see the mean for supplier B's chips is 94.6 years versus 10.8 years for suppliers A chips. Which one would you use? You only look at the mean. You say, wow, B has an average life of 94.6 years. So that's the one you choose. But if you look very carefully at the data, you'll see A is very consistent. The chips, the 10 in your sample, every one of them lasted 10 or more years. But look at B's chips. Some lasted 170 years. You don't need that or 160 or 150 years. But several of them didn't even get to year three. In fact, four of them, four out of 10, didn't make it to year three. So we need a way to measure this kind of dispersion or variability if you like. And certainly, if you're smart, you don't take chips made by supplier B. And the measures that we're going to use, we're going to look at a whole bunch of measures of dispersion. But now at least you understand why you want measures of dispersion. If you didn't look at measures of dispersion, you would make the mistake of buying chips from supplier B because the higher mean, even a higher median too, in fact, those are measures of location. But we're going to look at other measures that warn us not to buy chips from supplier B with so much variability. What do we mean by dispersion? It's like the spread of the data. You have a bunch of numbers. If they're all very, very close to each other or very close to the mean of the data set, obviously it has a very small spread. What about they're very far apart? They're all scattered. They're not near each other. And they're far from the mean of the data set. That has a high dispersion. So we're going to look at five measures of dispersion, the range, the interquartile range. The last three are really almost the same thing. Variation, variance, and coefficient of variation. We often do them together. Those are the five measures of dispersion we're going to examine. The first measure of dispersion we're going to look at is the range. It's the obvious one. We're all born understanding it. You don't need to take this course. You certainly don't need to be a statistician. You can even explain this to your boss. The range is just the range of the data, the largest value minus the smallest value. You see a bunch of data over there? The range is the largest value minus the smallest value. 30 minus 1 is 29. Isn't that easy? You don't have to know the center of the data. You don't have to know how many data items you have. Just the largest and the smallest. Easy to explain. You can explain the range to anybody. Anyone can understand it. That's a huge advantage. What's a disadvantage? A disadvantage is if you have any extreme value at the low end or at the high end, it's going to blow up the range, and the range won't really be a very, very good measure of dispersion. Imagine if all the values are close to each other or close to the mean, and then all of a sudden there's one out at the high end. That won't be a very, very good summary measure of your dataset as a whole. The interquartile range is just what it sounds like. If you remember your quartiles, it's the range between Q3 and Q1. Instead of taking largest value minus smallest value, we take the value of the third quartile minus the first quartile. What happened over here? Look at the data. Sample size is 15. The interquartile range is Q3 minus Q1 or 22 minus 3. That's 19 as opposed to the range, which would have been 98. Look at that huge value at the end. 98 is the largest value. 0 is the smallest value. The range is easy to understand, but in this case it doesn't really give you very much information. The interquartile range now is a smaller number. It's a number that's easier to understand. But the way to understand the interquartile range is that it is the range of the central 50% of the data. If you take the central 50% of the data and separate it out, and you can do that easily with quartiles, the range of that is the interquartile range. Any extreme values on the high side or on the low side are basically thrown out, right? They're not thrown out in this case. Looking at them, they don't affect this measure of dispersion. Of course, that's also its greatest flaw. Data is very expensive. It's expensive to purchase. It's expensive to collect. What we're doing here in order to compute this measure is throwing away 50% of the observations and only looking at the central 50%, the range of the central 50%. Let's see if we could do better. Let's look at the standard deviation. This is one of the most famous measures of dispersion. What is the standard deviation? Deviation from what? It's deviations around the mean. It's kind of an average. You're averaging out the deviations. It's a little problem, though, if you just look at the deviations around the mean. The deviation around the mean mathematically would be the sum of xi minus x bar. If you do that, you'll get zero all the time. The average deviation is about the mean, which is zero. The reason for that is you'll have plus deviations. Some of your values are above the mean, some are below the mean. They kind of always work out mathematically to give you zero. That's why we can't just take an average deviation. Here we see the formula for the standard deviation. It's called a definitional formula. We take the average deviations, but we don't want the pluses and minuses to balance each other out, so we square the deviations. We take the sum of the xi minus x bar. That's the mean of the data set. We sum it, and we're summing squares. It's called the sum of squares. We divide by n minus 1. You'll ask why we're going to take an average. Divide by n is the reason. Mathematically, we'll get to it. There's your formula. That's called a definitional formula. It has lots of interesting mathematical properties. One I mentioned, you don't sum up to zero. This is also a minimum, this sum. No other value subtracted from the data set, from the x's. No other value that you look for deviation of the square will give you a smaller value than this. It's called the least squares property. And finally, which I mentioned already, we divide by n minus 1.n. It's called a loss of a degree of freedom. You'll see it's a mathematical adjustment. We're looking at two data sets. We'll call them x and y. Notice, they both sum up to 15. So the average is 15 over 5. So both have the same average of 3. Yet, we'll see, look at the variability. The first one, the x's, are relatively close to the 3. But the second data set, you've got zeros and tens. Obviously, there's a lot more variability. We'll see what that does. Look at the x's. First column is the data. 1, 2, 3, 4, 5. The mean was 3. We're looking at the third column, which is deviations. Minus 2, minus 1, 0, plus 1, plus 1. Notice it adds up to 0. Now we have to square them. So we square each of those deviations. Again, it's called the sum of squares. So we take each of the deviations, square it, sum it, and we get 10. Now, you take 10. Remember, it's n minus 1. You divide it by 4. Take the square root of that, which is basically the square root of 2.5. And you get 1.58. So the standard deviation for x is 1.58. Now, look at the y. You see the deviations are much worse, much higher. You've got 0, 0, 0, 5, 10 in the first column. The mean is still 3. So it's 3, 3, 3. But look at the deviations now. Minus 3, minus 3, minus 3. 2 is even 1 of 7. The sum of square deviations, or the sum of squares, notice it's 80. I think 80 divided by n minus 1, 4, and that's the square root of 20, which is 4.47. Notice the standard deviation for the y's, the y dataset, is 4.47. The standard deviation for the x is 1.58. This is what we saw intuitively. The variability of the y dataset is much more than the x dataset. Now we're going to explain why we divide by n minus 1, generally. The reality is if you took a census, you'd have no problem. Then you'd be trying to measure the population standard deviation. Notice the formula is also the sum of the square deviations, but not using mu. It's a population. You have a census. Now you divide by capital N. There's no bias here. You don't have to worry about any kind of bias. So the formula, and again, you have to be careful when using a program like Excel. This standard deviation, you have to divide by n, and I think Excel calls it standard deviations underscore p, to show you that it's a parameter or a population. But normally, in this course, we're going to always assume you've taken a sample. That generally works in the real world. You're taking samples. Now when you're taking samples, s is supposed to be estimating sigma, the population standard deviation. Now mathematicians have shown us that when you do it, you must divide by n minus 1. Otherwise you're going to introduce a bias. S will not be an unbiased estimator of sigma. If you want an unbiased estimator, then you have to divide by n minus 1. Again, those of you who have more advanced mathematical training can look this up in a math book, you'll find out why. That's why we divide by n minus 1. All you have to remember is that when you're getting the standard deviation, if you're working with a census, which is rare, then you'll be dividing by n. If you're working with a sample, you take the deviations and notice you're using x bar, not mu. That causes the bias, actually. So you're taking the sum of the squared deviations around the x bar, you divide by n minus 1, and now s is an unbiased estimator of sigma. So you'll hear the term losing a degree of freedom. Essentially, if your boss says, why are we dividing by n minus 1? I'm sure we'll be averaging it. You tell your boss you're right, but mathematicians show us this introduces a bias. So to get rid of that bias, we must divide by n minus 1. And you'll see this in Excel, too. That this is a standard deviation when you're working with a parameter or population, and it's a standard deviation when you're working with a sample or a statistic. The next measure of dispersion, the variance, is really just exactly the same as the standard deviation. It's a formula that takes the standard deviation and squares it. Or conversely, when you've got the standard deviation, you ended up the process with a square root. Well, everything that was under that square root is called the variance. It's easier to explain standard deviation because you're using the expectation that we want something that's kind of averaging our deviations and it's a form of averaging it. So it's harder to explain the variance. In addition, the standard deviation is in the same units as the original data. The variance is in those units squared. It's harder to explain that. But it's still a valid measure of dispersion and it's used all the time. Where are you going to use the definitional formula? Same as before, same as with the standard deviation. The definitional formula, why do we like it? It helps you to understand what you're doing. You can see the deviations there. You can see the sum of the deviations squared. So you know you're looking at deviations and it helps you to understand that you're looking at the spread of the data about the mean. The computational formula is the one that you need to use if you're doing a computation with a lot of data. If you're using a calculator, you'll use your memory. Most likely you'll be using something like Excel or some kind of statistical package. Or for some people who write those statistical packages, they're going to be using the computational formula. The computational formula is it sounds as you can tell from the name. It's easier for computation. In addition, there's less rounding. In the definitional formula, as you'll see when you do your own problems, X bar is a mean you already have rounding there. Every time you subtract it from a data value and then square it, you're compounding the problem of rounding. So there's a lot of rounding in the end. We don't care because right now all we want is for you to understand it. In the real world, when you do this for real as a user as a statistics or as a statistician or as a business person or as anyone who has to deal with numbers, you're going to be using the computational formula and you're not even going to have to know the formula. It'll just be what's being used in Excel or some other software package. But what I want you to do and what the other Professor Friedman wants you to do is to understand the formula that you're using in order to measure dispersion. Finally, the final measure of dispersion we're looking at is the coefficient of variation. The coefficient of variation is a little bit different from the others is it doesn't only measure dispersion in the data, but it creates a raw number, a perfect number percentage and that means that you can compare data sets because you're not comparing apples and oranges in the classic metaphor. Look at that formula. You can see that what you're doing is answering the question what percent of the mean is the standard deviation? So you're looking at the standard deviation in relation to the mean. You can do that because they're both in the same units and when you cancel the units you get a pure number. You multiply by 100% and you get a percentage. So for example if you find a coefficient of variation that's 100%, it means the sample mean is equal to the sample standard deviation, wow that's a lot of variability. And what if you find the coefficient of variation is 200% certainly that's even worse. So if what we want to do is to look at more than one set of data we probably want to include coefficient of variation in our metrics in our summary statistics, in our descriptive statistics. Why? Because we cancel the effect of the units if we're using coefficient of variation so that if for example, and we have a couple of examples here if for example you're looking at two stocks and you want to compare them and see which is more variable, which is more volatile and one is in dollars and one is in yen or in some other currency. How do you know? Just looking at the numbers. Well, the mean is in the units, let's say dollars, the standard deviation is in the same units, if we divide one by the other we're getting around the effect of the units and looking at just a pure percentage. The same thing is true by the way, even if both data sets are in the same units but are in order of magnitude very, very different. In this case, the example is saying what if you have a stock that sells for around 300 as opposed to one that sells for about 25 cents. Go even further. How about if you're looking at income data and one data set is in the millions and the other data set is around minimum wage. You can't really compare the volatility there's no way to know until you actually get the metric where you're canceling the effect of the units. So the coefficient of variation is not only a better measure of dispersion in these cases it's a very necessary measure to have at our disposal. Those of you interested in finance we have an example here with two stocks. Look at stock A and stock B. Now if somebody asked you, a customer, a client says which one is more risky or volatile, these volatilities are a term in the stock market. If you just look at the standard deviation of variance, look at the standard deviation for stock B it's 11.33 dollars. Look at the standard deviation for stock A it's a dollar 62. You might make the mistake of saying well stock B has a much higher standard deviation it's more volatile. But you'd be wrong. Look at stock B the numbers are really very close to each other your mean is very high your average is 188.8 so when you look at deviations you get big numbers relatively speaking because $210 minus 188 so you're looking at $22 Now look at stock A the average price of the stock is $170 so your biggest deviation is minus $170 just $3.30 so you might again looking simply at the standard deviation or even the variances you'd make the mistake of thinking stock B is more volatile. Notice what happens when you look at the coefficient of variation which is not in dollars the variance would be in dollars squared the standard deviation is in dollars coefficient of variation is a percentage so the coefficient of variation for stock A turns out to be 95.3% showing you that's a lot of volatility that's incredibly volatile stock A you can see looking at the numbers this stock at one time was 20 cents in August and then it jumped up to $5 in July and then you see it's been jumping around all over the place you can lose a lot of money investing in a stock like that stock B look at the coefficient of variation it's 6% because it doesn't vary that much it was $210 in August and it was $175 in February but that's not as huge a jump as you see with stock A so the conclusion here is that do not use the standard deviation to compare two stocks use the coefficient of variation that has no units it's a percentage and essentially gives you the idea of the standard deviation what percentage is it of the mean let's look at a problem we're going to get all the basic descriptive statistics okay the data has been ordered for you starting with 0 there's 10 observations 0, 0, 40, 50 all the way to 100 okay now you're asking the descriptive statistics well the mean you add them all up the sum of the x which is 560 divided by 10 so the mean is 56 the median remember what you got to do when n is even you look at the two middle values the median q2 is 55 okay the mode well which one came up most frequently well it seems like you have three modes 0, 50 and 100 okay the approximate q1 and the approximate q3 remember the numbers below the median you have five numbers so q1 is 40 taking the median of 0, 0, 40, 50, 50 do the same for the above the median 60, 70, 90, 100 so the median of those five numbers is 90 so q1 is 40 q3 is 90 the range okay now are measures of dispersion the range highest minus lowest 100 minus 0 the range is 100 then to quartile range q3 minus q1 and you got 90 minus 40 which is 50 the variance take each number minus the mean and square it so you are going to do 0 minus 56 squared 0 minus 56 squared 40 minus 56 squared until you get to 100 minus 56 squared okay those are the sum of squared deviations the sum of squares that's 11,840 remember you divide by n minus 1 10 minus 1 is 9 and there is your variance 1,315.5 the standard deviation just the square root of that which is 36.27 and the coefficient of variation is a percentage 36.27 which is the standard deviation divided by the mean of 56 times 100% and you have a very variable data set it's the coefficient of variation of 64.8% that's quite variable we have looked at measures of location in our data in other words where this data sits on the scale of real numbers and especially the central location measures of central location the center of gravity of the data then we looked at measures of dispersion how far apart are the data from each other how far apart or how dispersed is the data around the mean how large is that the amount of data and now we are looking at the shape of the data is the data symmetric about the mean or about the median or is it a skewed by extreme values either on the right or the left if our data is skewed it could be positive skewed or negative skewed if it's positive skewed it means that there are extreme values on the high side and so the mean is greater than the median if it's negative skewed it's the other way around it means that there are extreme values on the low side and so the mean is less than the median and of course this is nothing new to us because we already know that the mean is affected by extreme values while the median is always going to be the center point of the data here is the same explanation only with pictures graphical explanation you can see that if something is left skewed it's pulled to the left by an extreme value or more than one if a distribution is right skewed it's pulled to the right kind of imagine an elastic band being pulled by some large data value if it's symmetric you don't see either one of those than the data set is symmetric around the central location the measure of central location let's examine a data set we're looking at 12 employees and determining how long it took each of the 12 to complete a certain task note the fastest employee did it in 2 hours 8 hours and then the maximum value was 63 hours somebody took a long long time any case you look at the mean of this data set and the mean is 180 over 12 which is 15 hours the median is 10 hours when you see a huge discrepancy between the mean and the median that tells you something it usually indicates what we call skewness notice the mean is much higher than the median again the mean is 15 versus a median of 10 a 5 hour difference now what caused that why are the mean and the median so far apart and obviously you look at the data you'll see there's one value you might call it an outlier that 63 seems to be very different from the rest that 63 throws everything off now what happens when you have one very high value it makes the mean higher but the median which is really location stays where it is so the median is 10 but that 63 now throws the mean so it's much higher than the median and the same way what would happen on the left side if we had a very very low value okay but this is called a positive skew and you see the mean is so much more than the median you have what's called a positive skew now try to imagine your mind that 63 would be 163 let's say it's the boss's son and he's the total klutz took him 163 hours to do something that should take about 10 hours or so well then now the mean is even higher than 15 it jumps okay so you can't fire the boss's son but you're certainly going to have very skew data and this is called a positive skew because you have a very high number an outlier that throws the mean and makes it much higher than the median and just to complete the data set alright let's look at the variance turns out to be 2868 over 11 which is 260.73 square again don't ask what a square hour is that's what happens when you use variances you get a square it's a square hour the standard deviation which is a square root is 16.15 hours and the coefficient of variation is 107.7% just as a statistician you should look at this data and say that's interesting that the standard deviation is higher than the mean that shows incredible variability in a data set if you're in quality control you want that coefficient of variation to be a lot lower than a 107.7% but that's again something for another course but anyway now you've seen how data gets skewed one way of presenting data is with a 5 number summary that gives you some information about the shape basically you need the smallest value which in this case was 2 the largest value which is 63 the q1 and the q3 and q1 is 8 and q3 is the average of 15 and 18 one thing we could see from this data set just looking at it but we have the median too just looking at this data set we could see the distance between q3 and the largest value is a lot more than q1 to the smallest value that's going to indicate that it's skewed to the right so we're going to see that this is skewed data again look at the bottom you can see the distance from q3 to the largest value is 16.5 to 63 and that's a lot more than the distance from the smallest value to to the q1 which is 8 so that's an indication of a right skew this is another way to present the data it's called a box plot or a box and whisker plot sometimes they use the word whisker you show the data from the smallest to the highest you see the line starts at roughly 2 and goes all the way to 63 and just looking at this you could see in the box in the center which gives you q1 you see the 8 and then you see the median there which is 10 and then you see q3, 16.5 now notice it's the box which has q1, the median and q3 is all the way to the left it's not in the middle when data is symmetric it will be right in the middle everything will be, you'll see what it looks like common sense tells you what it's supposed to look like here we see that whisker to the right which starts at 16.5 it goes all the way to 63 that's a long, long whisker the whisker on the left which starts at 2 and goes to 8, this is a tiny whisker it's the same logic we had before by the way with a 5 number summary but you can get this kind of print out you ask for a box plot or a box and whisker plot and some computer programs will print it out for you and right away you'll see if the data is skewed to the right skewed to the left if it's symmetric, if everything's in the center and it's centered then you've got symmetric data this is not symmetric we're still in the middle of the descriptive statistics part of the course there's going to be another descriptive statistics lecture following this one but for now you have enough material to start doing problems and remember do your homework, do whatever problems you can find problems everywhere because the more you practice the better you're going to be you'll remember the material and you'll also do well on your exams practice, practice and as always practice