 Welcome to our lecture on descriptive statistics. This lecture is long and so you'll see that it's in several parts. In this lecture, we are obviously talking about descriptive statistics. We're going to learn different techniques and metrics and graphical techniques for summarizing data that's in front of us, as opposed to inferential statistics, which as you know from the previous lecture, means we want to take the data that we have in front of us and draw conclusions about the larger population that the sample data came from. Over here, we're only interested in the data in front of us. We might want to measure, let's say, the performance of a class, the performance of a unit in a company. We're not making any larger generalizations to the larger world. We have a lot of data. Data sometimes comes in huge numbers and huge data sets. We could have thousands of pieces of data maybe and more. So how do we take this data and address the information that's in it? You don't want to just hand someone a huge database and say, oh, here, take a look at this. Those are my descriptive statistics. No, you really want to summarize this data in some meaningful way. So it's not just raw data. So it is information. We also know from the previous lecture, we've got different levels of data, various levels of numerical data. We have categorical data and we work differently with different levels of data. We're going to look at some quantitative metrics and we'll look at some, but not all, a graphical techniques. We will learn various techniques for summarizing and describing our data, whether the data is numerical or categorical. With categorical data, what do we do? We know we can only have frequencies, maybe percentages, maybe we can draw pretty pictures. We're going to look more at graphical data in a later lecture, a later part of the descriptive statistics lecture. That's totally dedicated to grouped data where we use frequencies. Along the way, we will see some graphical methods, but we're not going to do a lot of the simple stuff that you're basically born knowing. So I'm not going to ask you to draw pretty pictures, a pretty pie chart to show how the data is broken up. That's not what we consider our job here. In front of you is pretty much an outline of most of the descriptive statistics lecture, no matter how many parts it's broken up into. Most of it is going to focus on data that comes in a single variable. We're going to study various measures of location. These are metrics that are used to locate our data on the scale of real numbers. Then we're going to look at measures that have to do with how far apart the values are from each other. We're going to look at a bunch of other stuff, measures of shape, how the distribution is distributed. Maybe it has a peak, maybe it's slanted. We're going to look at some ways of summarizing all these measures at once and very, very useful. We're going to look at standardizing data, which you know a little bit about already, I think, and we're going to work with it in a certain way that's going to stand us in good stead later on in the semester. And I hope you see that very, very soon. OK, we're going to look at measures of location. You'll see that there are two kinds, but right now we're going to look at measures of central tendency. In other words, the central location. This is kind of a measure that summarizes the whole data set. Sometimes you just want one number that kind of represents the entire data set, and these include the mean. Obviously an average, the mean is an average. An average is another way of summarizing. How do you do this semester? Well, I took 10 quizzes and my average was, the average is a measure of central tendency. It's kind of a summary. You might want to use the median. All right, it's another one. And sometimes we use the mode. Those are the three we discussed in this course. There are a few others, but only these three are discussed. OK, let's look at the sample mean. The sample mean is the sum. Notice that symbol, the sigma, sigma x i. And again, if you forgot what it means, that's sigma. Just go back to the boot camp. We explain how to use it. There's an explanation right here too. The sum of the x i is x 1 plus x 2 plus x 3, etc. You're just basically summing a column of numbers or a row of numbers. So we take the sum of the x i over n. That gives us the sample mean. Let's look at two examples. The first example, you've got 1, 2, 2, 4, 5, 10. And you're asked to calculate the sample mean. n is 6. There are six observations. And you sum up the numbers. This is a row of numbers. We sum up the row. We get 24 is the sum. 24 over 6 is 4.0. That's the sample mean. Example two. Here we have five numbers. 1, 1, 1, 1, 51. n is 5. We sum that row of numbers. We get 55. 1 plus 1 plus 1 plus 1 plus 51. 55 over 5 is 11.0. That's the sample mean. Note that the sample mean is affected by extreme values. That 51 there suddenly changed the mean by a lot. One big, one huge number can change the mean. The median is basically the middle value of the data after you've ordered it. You have to order the data. Let's do it from lowest to highest. I like to do it from lowest to highest. Order the data. Now the data is in a specific kind of order. Low is the highest. The median is going to be the middle value. Which basically means that half the values in your data set are below it. And half are above it. That's why the median can also be seen as the 50th percentile. We'll get to learn about percentiles shortly. But the median is the 50th percentile. Half below, half above. Now what do you do? After you order the data, again, order the data. If n is odd, the mean is actually the middle value. So if n is an odd number like 7, 9, 11. The median is the middle value. If n is even, you've got to average the two middle values. Then you have two middle values. Average them. And the average of those two, that becomes the median. Here's an example of the median. Look at the data in front of you. 0, 2, 3, 5, 20, 9900. It's been ordered. Low is the highest. n is odd. n is 7. There are 7 numbers there. Since there are 7, the middle value, the 4th one, is the median. Now look at 5 carefully. 3 of the values in your data set, 0, 2, 3, are below it. And 3 are above it. 20, 99, 100, above it. That's why you know you've got it right. That is the median. The median is 5. And I think to know about it. The mean and the median are unique for a given set of data. There's one mean for a data set. There's one median. But what's interesting about the median, notice if you change the 100, make it 5,000. Since the median is really a position, it's the middle position. Nothing has changed. So even if the 100 becomes 5,000, the median remains 5. But what happens to the mean if you change the 100 to 5,000? It'll increase dramatically. So the mean is affected by extreme values. The median is a position. So you can have an extreme value. It's not going to change the median. Now let's look at example 2. We have n of 6. There's 6 values there, 10, 20, 30, 40, 50, 60. Now what you've got to do, you have 2 middles in effect in the data set. So you've got to take the average of the 30 and the 40, the 2 middles. Just take an average, 30 plus 40 over 2. And now the median is 35. Notice 35 is not one of your observations. But 3 of the values are below 35, 10, 20, 30. And here above it, 40, 50, and 60. So again, that 35 is the median. Half of the data is below that value. Half the data is above it. What do we know about this measure, about the median? For one thing, as we saw already, the median is not affected by extreme values like the mean. It's only affected by the number of observations because as you saw, it has to do with the position, the middle value, the one that's exactly at the center of your data. Extreme values do nothing. That's one reason we like to use the median for income data, let's say, and also of interest to you and to me, exam scores. The second thing we know about the median is, when you think about it, this should be obvious, if you pull a piece of data from your data set at random, any particular data value pulled at random is just as likely to be greater than the median as less than the median. Sometimes that could be a nice property. And finally, this is really more for those of you who are interested in going further in, let's say, maybe mathematical probability. So yeah, everyone else, you can tune out to the next slide. The summation, mathematically, the summation of the absolute value of the differences around the mean, the deviations of any particular data value from the mean. If you take all of those, just the absolute value because you don't want to look at pluses and minuses, and add them all up, that's the smallest that it could possibly be. Mode is the value of the data that occurs with the greatest frequency. Look at the example below. You got 1, 1, 1, 2, 3, 4, 5. Now, the 1 shows up three times, so that's the mode. Everything else only appeared once. Look at the next example. 5, 5, 5, 6, 8, 10, 10, 10. Now you have two modes. Again, it's not impossible to have several modes. In this case, you have two modes, the 5 and the 10, because they each showed up three times. And we call that a bimodal, two modes, dataset. Okay, let's look at some properties of the mode. The mode, unlike the mean and the median, which always exist, you may not have a mode and it may not be unique. Look at the first problem. And if I ask you, what's the mode? 1, 2, 3, 4, 5, 6, 7, 8, 9, 0. Well, it has no mode. Everything showed up once. There is no mode. Ten observations, there's no mode. However, there is a mean and there's a median, but no mode. Look at the next problem. The mode may not be unique, as we saw before. Look at that. You have, there's a 0 and there's two 1s, two 2s, two 3s, two 4s, two 5s, two 6s. So notice how many modes you got. You got 1, 2, 3, 4, 5, 6. They show up twice. You have six modes. Quantiles are measures of non-central location. Okay, you don't always want the center. You may want something that's not at the center. The most commonly used quantiles, quartiles, which we'll discuss in this course. Quintiles, used a lot by economists when they want to measure income and equality. Desciles and finally percentiles. Let's try to understand quartiles. It's in the word quarter. So obviously you're breaking something up into quarters. Four parts. You want to split your data. It's ordered data. You want to split it into four parts. Try to imagine you have this chocolate bar, big chocolate bar, and you want your four kids. You want to cut it up so you have four equal pieces. How many cuts do you need? So you have four equal pieces. Well, the answer is always one less. You want four, you need three cuts, right? Let's smack at the first quarter, second quarter, third quarter. Okay, Q1 is the first quartile. What's true of the first quartile? It's actually the 25th percentile. 25% of the observations are smaller than Q1. 75% are larger. That's why another way to see the Q1, the first quartile, is it's the 25th percentile. Q2 is the second quartile. 50% are below it. And the 50% are above it. In other words, it's the median. The second quartile is the median. It's also the 50th percentile. Finally, the third quartile, 75% of the observations are below it, below Q3, 25% of the observations are above it, and that's why you can usually call it the 75th percentile. Okay, we're going to look at a data set and see how we can get a quick and dirty approximation for the Q1 and the Q3. Look at this data set, 210 to 20, all the way up to 270 to 80. Okay, you've got 10 numbers there. Okay, and if I ask you to get the median, which by the way is Q2, you know if N is even, it's the average of the two middle values. So you take the average of 225 and 235. Your median is 230. Okay, so now the median is 230. You have five numbers below that value and five above it. Now take those five numbers and get the median of those. Pretend there are only five numbers. All the numbers below the median, only those, nothing else, take the median of that. If you were to do that, again you got five numbers starting with 210, ending with 225. So Q1 now is the third observation of 225. Take the five numbers above the median. That's 235, 240, 250, 270, 280. Get the median of those five numbers and that'll give you an approximation to Q3 and that becomes 250. This is the quickest way to get Q1 and Q3. It's an approximation you could do with a formula. In fact, Excel does it with a formula and it'll get you a slightly different answer. But for a test story, you need a quick answer. This is the way to go. Again, if you want to do it the correct way and you have a big data set, which is the way it really is in the world, you would be using Excel and it'll do it for you. Here's a problem. The company has 12 salespeople selling computers. We have data for these 12 salespeople for the most recent week for which data was collected and that's sales per week. We want you to compute the mean, the median, the mode and the quartiles. Obviously, as you should know, the best way to do this is pause the video, pause the PowerPoint narration, do the problem and then come back. Here we are again. How do we answer this question? Other than the mean for all the other, for the median and the quartiles, we have to put the data in order. The mean and the mode, we could probably figure out without that, but the mode would be difficult. We might as well just order the data and you see the order data right there. The sum of all the sales for the week is 76 over the 12 salespeople. 76 divided by 12 is 6.33. That's the mean, the average number of sales per computer rep for the week was 6.33. The median remembers at the 50% mark. In the order data, what's the 50% mark? We have 12 observations, so it should be between the 6th and the 7th. 0, 2, 3, 4, 5, 6. That's the first half. So the median is smack in between 6 and 7 or 6.5. The mode, well, there are two 10s and just a single instance of every other value. The mode is 10. Q1, we take the lower half of the observations below the median and get the median of those and that'll be between 3 and 4. So Q1, the first quartile is 3.5. We do the same thing for the larger numbers, the half of the data set that's above the median and we're halfway in between the 9 and the 10, again halfway in between the 3rd and the 4th observations so that's the Q3 is 9.5. Right, let's look at some other quantiles. We're focusing more on quartiles in this course, and percentiles, but in other courses you may hear the term decile. What is a decile? That's where you have 9 cuts essentially and you cut the data up into 10 equal portions. So the first decile is the 10th percentile, 10% below, 90% above it. The 7th decile, 70% of the data is below it and 30% is higher than it. Sometimes we talk about quintiles in particular. So we look at quintiles, you need essentially 4 cuts and you divide the data into 5 equal portions. So the first quintile, 20% below, 80% above. If you go to the 4th quintile, 80% below, 20% above. In economics we look at these quintiles, we look at the 4th quintile and we compare it to the 1st quintile as a measure of income and equality. Percentiles is going to be the next slide. You need 99 cuts. Imagine there's a long trochal bar and you need 99 cuts, and guess what? You have 100 equal pieces. As you know, we need 99 percentiles to divide the data into 100 equal portions. Verify we use percentiles for standardized exams like the SAT, test like that. So what does the score of 40 on a standard test mean? You don't know what it means. It seems a horrible grade. You've got a 40. But really, if it's a 99th percentile, it's actually a great score. The 99th percentile means that you beat 99%, but 1% beat you, assuming you're on the line. Q1, Q2, Q3, we already spoke about this. Q1 is a 25th percentile. You beat 25%, 75% beat you. If you're at Q2, which is a median, you beat half the people taking that test, half beat you. Q3, you beat 75%, 25% beat you. Folks, we're going to show you how to use Excel to get percentiles. Usually you need a big data set. You don't do percentiles unless you have several hundred observations, even millions. So we're going to use the computer. We'll learn how to do this in Excel. Look at problem one. N is 16. There are 16 values. We ordered it for you already. It's ordered. The lowest value is 1. The maximum value is 10. And now we want to get the mean. Again, you don't need to order data to get the mean, but it helps. So we add the numbers all up. We're adding up 16 numbers. The sum of xi from 1 to 16, that's 65, divided by N of 16, and the mean is 4.06. The median is the middle value. You've got to average the two middles, since N is an even number, N is 16. So it's the average of 3 and 4. Notice the little red line there. The median is 3.5. The mode, which showed up the most, the highest frequency was 2. So the mode is 2. Q1. We're using an approximation, a shortcut. Look at the eight values below the median. And take the median of that. Notice, you see the little red lines? 2. Q1 is approximately 2. Do the same for the values above the median. The rate of that, starting with 4, ending with 10. You're averaging 5 and 6, 5.5. That's the approximate Q3, 5.5. From looking at absences, we took a sample of 13 people in the company, and the absences range from 0 to 12. Okay, so first we order the data. The mean added them all up, 39 over 13, 3.0 absences. That's the sample mean. The median, again, since it's 13, and is odd, the middle value is 2. It's red, to show you that's the middle value. The median is 2 absences. The mode, the one that showed up most, was 0. 0 absences. Q1. Take all the numbers below that red 2. The six numbers get the median of that. And then it's 0.5, the average of 0 and 1. Take the six numbers above. It's the average of 4 and 5. You see the broken red line, 4.5. That's our approximate to Q1 and Q3. Let's look at problem 3. These are reading levels. If you get a 5, that means you're reading at the level of 5th grader. 9 means you're reading at the level of the 9th grader. These are 8th graders. We'll look at 16 of them. First, we order the data. It starts with 5, and the highest is 10. First, to get the mean, you want the sum. 120 over 16. 7.5. The average reading level of the 16 students we selected randomly is 7.5. Right in the middle between 7th and 8th grade. The median, again, N is even. 16. It's the average of the two middle valleys of 7 and 8. 7.5 is the median. Which, by the way, is also Q2. Approximation for Q1 and Q3. We take the numbers below the median. They're 8 of them. And we take the average of the two middles, 6 and 6. So 6 is Q1 approximately. And take the 8 numbers above the median. Starting with 8 going to 10. And again, you see the broken red line. So the Q3 approximation is 9. The mode is the value that showed up most often. And that was a 9. We're going to revisit this problem and show you another way to do this using frequency data and the group data. So we'll do this problem all over again. In another way, but it's the same idea. This topic will be continued in the next lecture on descriptive statistics. Remember, the best way to learn this material is to do a lot of homework, find as many problems as you can, and practice, practice, practice.