 Welcome to this continuation of our lecture about descriptive statistics. In this part of the descriptive statistics module, we're going to be looking at ways to work with data that is in frequencies, group to data, categorical data, getting summary measures, and looking at graphical approaches. Let's start by taking a quick look at a problem we've seen before at the beginning of our lecture on descriptive statistics when we were still focused on measures of location. We looked at this example of reading level. A random sample of 16 eighth graders in a particular school were tested and they showed reading levels from five, meaning fifth grade through ninth grade and 10th grade. So some of some of them were above some of them were below their grade level and the solution we found then is listed for you here. I'm not going to belabor it. We did it already. You order the data and you get the location statistics, the measures of central location and the measures of non-central location. Now the previous problem actually can be set up somewhat differently and you're going to like it better this way as a frequency distribution. Notice you have a five, a six, a seven, a eight, a nine and a ten and now we show the frequencies. Five showed up twice. The six showed up three times. The seven showed up three times. The eight showed up twice. The nine showed up five times and the ten showed up once. You can see why we call it a frequency distribution. It's also called group data. We're grouping it. We're grouping the data to make it a lot easier to work with, especially when you have a huge data set. Here we only have 16 observations. That's why you can ungroup it. It's only 16 separate numbers. But again, with a big data set, you want to group it. Now let's see it. We've grouped it now. We're only looking at six different values. We're looking at reading levels from five all the way up to ten. Notice the important thing to recognize is the sum of the frequencies. Add up all the frequencies. You've got the n. So we sum them. Two plus three plus three plus two plus five plus one. 16. That's n. We have 16 numbers. So n is the sum of the frequencies. Now the mode, you can see it immediately. Mode is the one that occurred most often. Which value came most often? Notice that the highest frequency for the nine is five. The nine showed up five times. That's your mode. Now you know nine is the mode. And again, remember all your values, all your measures actually, the x's, you call them the x's of measurements, they're basically the values of five, six, seven, eight, nine and ten. So your mean has to be between the five and the ten. Same thing. The median will also be between five and ten. Your mode will be one of those values. Now we figured out the mode is nine. We can see it. That had the highest frequency. Now the median has to be somewhere in the middle. Since n is 16, it's going to be kind of right, think of the eighth value. It's already grouped and ordered. It's ordered for you. So we have 16 values. It's got to be the eighth number. Think of the number that cuts the, you have eight above and eight below. Count down eight numbers. Using the frequencies. Two, three, three. Notice the eighth value, the last of the eighth value is the seventh. In fact, if you go backwards and you go 152 in the frequencies, you'll see that the last value was an eight. You have eight above and eight below, and you go right in the middle between seven and eight. So we know the median is observed just by looking at it. It's right between the seven and the eight, which is seven and a half. So your median is going to be 7.5. Okay. That makes things a lot easier when you set it up this way. Okay, let's compute the mean for group data. Okay, the formula is right there on the bottom. The sample mean is the sum of the xi times its frequency divided by n. Okay. So in other words, what you have to do now is have a new column called xi fi. All right. So we have five times two, which essentially the same as saying, I had two fives. I'm adding them up. Five plus five. Instead of doing five plus five, I say two fives is five times two is 10. There were three sixes, which is another way of saying six plus six plus six, that's 18. Seven times three, the three sevens, that adds up to 21. Seven plus seven plus seven. Eight times two is 16. Five nines, nine plus nine plus nine plus nine or 45. And 10 times one is 10. You add up that last column, the sum of the xi fi, you get 120. That's the same as you would get if you ungroup the data and add up the 16 numbers you started with. But again, you don't have to do this that way. Get the column of sum of xi fi, which is 120, divided by the sum of the frequencies n, which is 16. 120 over 16 is 7.5. And 7.5 now is the sample mean. That's your average reading level. Okay. Let's look at this problem. Okay. Now here we have a thousand values. Right. The sum of the frequencies are thousand. That means n is a thousand. Again, you're not going to want to write this ungrouped, which means you have to write out a thousand numbers. Okay. So now, first of all, you know n is a thousand. The first thing you note, then you do the third column, the xi fi column. Zero times 10 is zero. One times 20 is 20. Two times 30 is 60. Three times 40 is 120. And eventually you'll get to 10 times 100 is a thousand. You add up the last column, which is 5,960 absences. And that's the total number of absences for the thousand people working in the company. All right. So if you want to get the mean, all you got to do is divide 5,960 over a thousand. And the mean is going to be 5.96. Okay. Now let's get the median. And then we'll do q1 and q3. Now the median is kind of the middle value when you order the data. The data is already ordered for us. It goes from lowest to zero to the highest. 10. Okay. We have a thousand observations. So it's going to be the middle value by the 500 mark. All right. So think about going down 500 in frequency. Okay. So we have 10. That's our first frequency for the zero. Plus 20 is 30. We're not yet at 500. Another 30 is 60. By the time we finish the last three, we've had 100 values. We've used up 100 values. We go to the four now. We have another 50. So it's 150. 150 values have been used up. Okay. Now the fives can't go past that. That's 400. 401. 50 is 550. We want to get to the 500th value. So somewhere in the fives, somewhere in one of those fives. Okay. That's going to be the median. So the median is five absences. And as a check, you can go in the other direction. Go backwards. Go up 500. 100 plus 100 is 200. Plus 60 and 40 is 300. Plus 150. So now we have 450 values. Going up, we have to go up 50 now. And guess what? We're again in the fives. So that's two ways of getting the median. You go up, but you go down, and then you get the median. Now the mode, you can just look at it and you find the mode. The mode is a five. That has the highest frequency. 400. So I've shown you how to get the median. I've shown you how to get the mode. We did the mean already. Now let's try to get Q1 and Q3. You do the same thing, but now you're counting down. For Q1, you count down 250 observations. Again, only using the frequency column. Count down 250. 10 plus 20. That's 30. 30 and 30 is 60. 60 and 40 is 100. 50 is 150. Now you want 250. You're going to be somewhere in those fives. One of those fives will be your Q1. So Q1 is five absences. You want to get to Q3. Do the same thing. A quarter of a thousand is 250. Count. Go backwards. Go from the other direction. From the starting and the bottom. 10. You get 100 and 100 is 200. If you go past the eights, you'll have 60. So somewhere in those eights. One of those eights is going to be Q3. So now you found that Q1 is five absences. Q3 is eight absences. Again, to get a median, you just take your end to some of the frequencies, cut it in half. Half of it, like in this case it was 500. And then you can go up and down the frequency table, and you'll get the median. So you want to get Q1. You take a quarter of M, which is 250. Starting from the lowest number, count 250, you get to Q1. Starting from the bottom, go backwards, go upwards, and then you'll get Q3 after you count 250. And again, the interquartile range is the difference between Q3 and Q1, which we're going to learn about. And that's just eight absences, minus five absences or three absences. Here's another problem with group data. We're looking at defects and the frequency of these defects. So we're an automobile manufacturer, and we see that 10 cars had one defect. 10 cars had two defects. All the way up to 20 cars had 10 defects. So essentially, if you look at the sum of the frequencies, you'll note right away that the sample size was 350 cars. Just add up the frequencies, 10 plus 10 plus 20 plus 20, all the way up to the last 20, and that's 350. So the sum of the frequencies, N, is 350. The total number of defects is 2,290. That's the total number of defects. Just adding the last column, sum of the XIFI, 10 times 1 is 10, 10 times 2 is 20, 20 times 3 is 60, 20 times 4 is 80, 40 times 5 is 200. Until you get to the last one, 20 times 10 is 200, and now we know that the total number of defects was 2290. The mean, that's just 2290 over 350. So the average number of defects is 6.54, if you want to go to 2 decimal places. Always remember that the sum of the frequencies is N, not just sample size. Now the median, remember we're looking at N of 350, those 350 cars. So we have to basically for our median take half of 350. It's already ordered, so we've got to go down essentially half of 350 which is 175. So let's go down 175 using only the frequencies column. Go down, 10 plus 10, plus 20, that's 40, plus 20 is 60, plus 40 is 100, plus 50 is 150, is 150. Now in the 7 somewhere we've got to go down to 175. So we know somewhere in the 7s, that's 7 defects, somewhere in 7 defects is going to be our median. If you want to check your answer, go backwards 175. Starting now with the bottom, 20 plus 50 is 70, plus 60 is 130. Again, we just go up a little bit and somewhere in those 7s is our median. So the median is 7 defects. You want to get Q1 and Q3? Well, you've got to take a quarter of 350. So you need a quarter of 350 and that's as you know is roughly, it's 87.5. Alright, so around 88, 87.5. Let's go down 87.5 and that will get us to Q1. And then we'll go the other direction, going upwards, we'll do the same thing to get Q3. We've got to go down 87.5, so 10 plus 10 is 20, 20 and 20 is 40, then we have 60 and notice in the 5 somewhere we're going to have the 87.5 observation. So Q1 is 5 defects. Do the same thing going backwards. Let's start from the bottom. We want to go up 87.5. Now 20 plus 50 is 70, it's in the 8's. I can see right away, you can't go beyond that because then you'll have 130. So somewhere in the 8's is Q3. So Q1 is 5 defects, Q3 is 8 defects and into quartile range is Q3 minus Q1 is 3 defects. So far we have seen discrete group data that has been organized and collected into a frequency distribution where you have the number of observations that fell into each class. What can we use this for? Well, we've already seen using frequency distribution for numerical data with repeated observations which is another way of saying discrete group data. We could also use it for any quantitative data that has been grouped even if the groups didn't come naturally like collecting income data and organizing it by tens of thousands let's say. And it could also obviously for the same reason be used for categorical data. In fact, we don't really have very many other options for categorical data besides frequencies and percentages where you can have a frequency distribution, we'll see about that soon. Here's a quick example to show what we can do with frequency distributions other than just get the summary statistics like we got before. In this problem, a sample was taken of 200 professors and we asked each for take-home weekly salary. The responses ranged from $520 a week to $590 a week and remember the sample size was 200. In this case, we have numeric data, quantitative data. We did not need to use a classification technique but sometimes you do because it gives you more insight into your data. In this case, we had to collapse it by creating intervals that didn't really come naturally to the data. We just had to decide. I guess we decided here that we wanted seven categories. The range was $520 to $590 that's $70 divided by $7 $10 per category. You can see the categories $520 and under and under $530 there were six of those so we don't have individual values but we don't need them in this case. Six out of 200 is three so that's 3% of the total. That goes all the way through the seven categories. You have the intervals, the frequency and the percentage. That's the frequency and the percentage distribution and the top chart. Then we took this and created a cumulative distribution from it. You've seen this before you know what these look like. Basically what this is as you go along you don't look at individual categories you keep adding to the category that came before. If the first category that we used the classification was $520 and under $530 and there were six of them that's $520 to $530 so under $530 less than or equal to $530 there were six observations. Less than $520 there were zero observations. Less than $540 how many were there? Well the six from the under $530 and the $30 from $530 to $540 so all together $36 and for each of those there's a percentage too. It's the cumulative percentage distribution and when you get all the way to the bottom the final category since you're accumulating everything up has to be equal to the sample size so all 200 observations were below $590 which we knew to start out and that's 100% of the distribution. Over the next few slides we're going to look at some graphical approaches to grouped data taking frequency data or percentage data and seeing what the graph would look like the chart, a graph chart, a bar chart there are other ways of doing it we're just looking at a selection here. You can create a histogram kind of a bar chart from the frequency distribution in this case it's a vertical bar chart across the X axis are the categories, the intervals and they're discrete intervals they're non-overlapping and that's why it looks like bars we do not have regular quantitative data this data has been collapsed into intervals so you don't see a smooth continuous curve. The height, the Y axis is the frequency and you see that sometimes the individual frequency for each interval is written on the top of each bar for clarity. A frequency polygon is another way of portraying the frequency histogram it's not a curve although it looks like a curve it's actually the frequency histogram imagine it overlaid on a histogram the way it's constructed is simply by putting a point at the top center of each bar and then connecting the dots so you could see how that was done. Here's a graph of the cumulative frequency distribution with this data you saw the cumulative frequency distribution on a previous slide and this does look like a smoother curve even though it's not based on purely quantitative data but on categorical data that was in categories in classification the interesting thing about this and you'll see this more later in this course and in other courses is the S shaped curve which is typical of continuous distributions that are based on distributions of their mass in the center and are more or less symmetric and then on either side going down towards zero at the extremes you can see how it's at the left side it starts out slow where the smaller frequencies were and then in the middle with the mass of the distribution is it jumps because that's where the higher frequencies are and then it peters out again at the end at the right again where the smaller frequencies are so it's a typical S shaped curve for certain very very frequently used cumulative distributions including one that we're going to study very very much we're very involved in most of the semester and that's the normal distribution this descriptive topic even though it's at the beginning of the course and relatively easy it ended up being a bit longer all together than what you might expect there's a lot in there it's descriptive statistics and it's a large part of what we do is descriptive statistics don't forget a little bit of shameless reminder here you must do your homework if you're in this course to learn or if you're in this course to do well on your exams the answer is the same you have to find as many problems as you possibly can and practice practice practice and what comes out is a happy smiling student