 So, we are talking about descriptive statistics, which was one branch of statistics, and in last lecture we covered measures of central tendency, where the major function of descriptive statistics is to summarize the data, like presenting the data into one single value. And in last lecture we covered mean-median mode, those were the measures of central tendency that how tons of data can be described and summarized into one single value by describing them in an average score. Like in you have seen, wherever we talk about mean, we also talk about standard deviation or variance. So we know that mean is good and it gives a lot of information about the data, but we also want to know the variability or spread of scores in the data, that how the scores are spread out in the data rather than just a single central value of the data. So this function has been fulfilled by measures of dispersion or measures of variability. So what measures of variability tells you, they tell you about the dispersion of scores in the data or spread of scores in the data. And these are equally important and required as measures of central tendency like mean. So we will look at few examples and see how measures of dispersion are important. Here on this slide, you can see that I have put randomly four sets of data. And as I told you in my previous lecture that when data is symmetrical and that is smooth, usually the mean value and the median values are equal. Since you can see the data one, these are the scores of the five students on a test and the scores are 48, 49, 50, 51 and 52. So the central value or the mean value of somebody will ask me how students performed on the test, I'll report 50 is the mean or the average. Similarly in data two, again, there are five students and their scores. And you can see that the scores are 40, 45, 50, 55 and 60. Since the data is symmetrical, you can see that the mean would be 50. The central value, the median and mean will be equal. Again, the mean is 50 and somebody asked me, performance of students in this class, I'll say, mean value is 50. And same as in data three, it's a 30, 40, 50, 60, 70 with the mean value of 50. And in the data four, you can see that the students who are scoring from 0 to 100. So the scores of the five students are 0, 25, 50, 75 and 100. But still the mean is 50. If I'm telling somebody how students performed on the test and I'm just saying in a single one value that the mean is 50, it's not enough maybe. I need to tell something about the spread of scores in the data or dispersion or variability of scores in the data. As you can see in these four data sets, in first example, the variability is very low. Like all students are kind of 48 to 52, all are scoring in the very middle. And in the data two, you can see that now the variability is somehow increasing. And similarly in study three, data three, and then in data four. So it is important that we should know something about the spread of scores in the data also, and I will tell you shortly that how, and why, and where it is so important to know about the variability and dispersion in the data. So in today's lecture, we will cover mayors of dispersion or variability. Primarily there are like four or five mayors that have been written in your textbooks, which I have recommended already. So they are range, interquartile range. We also talk about mean deviation slightly briefly. And then there's variance and standard deviation. Let's start with the range. Range as we talked about in our previous lecture when we were constructing histogram and polygon that we calculate how we make the class intervals and class boundaries by looking at the range. So range is a rough measure and it gives you rough estimation of spread of scores in the data. It's very simple, it's crude and very kind of you just take the highest value in the data and take the lowest value in the data. You subtract the lowest value from the highest value to get an idea of what is the range. So it is usually maximum minus minimum. For example, if I have data like 42, 43, 44, 45, like the example similar first data set where the mean was 50 and the highest score was 52 and the lowest score was 48, I think. Yes, 48. The range will be 52 minus 48, which is almost four. And you can see the range in all the data sets in the last data set where we have zero minus zero and 100. So if we subtract minimum score from the maximum score, the range will be 99. So it's increasing. So mostly range is maximum minus minimum. But you have seen that in some formulas in the book, the formula is maximum minus minimum plus one also. Why? Because when we have a kind of discrete data or when we have a data inclusive mayor, we usually include both ends. We include the lowest and the highest score as well. For example, I have given this example here on the slide. I want to see the number of children in a family. And the data have one, two, three, four, five. Now there are families who have one kid and there are families who have five kids. So one and five are inclusive. It's a kind of discrete data. So I will tell the range is five and one inclusive, which is five minus one, which is four, and I have to plus one because there are families with one child and there are families with the five children. So in these cases where both the limits are inclusive and you are usually dealing with the kind of whole numbers, discrete data, you will be adding one highest minus lowest plus one and this is how we do. What are the advantages and disadvantages of using range? Yes, range is kind of a quick estimation if somebody wants to know about the dispersion of scores in the data. For example, if I have taken an exam and somebody asked me the performance of the class, the quick and rough estimation could be that I can see the highest score in the data and I can see the lowest score in the data. And I can see, oh, this is how students are scoring on this test. So it's quick and it gives you rough estimation about the spread of scores. But of course, there are some cons as well or disadvantages as well. Why? Because it just focuses on the extreme values. It ignores all the middle values because it's just take the maximum value and the minimum value. Like we have talked about earlier also that if I have one, two, three, four and a hundred value in it, then the range will be like hundred minus one. But if I look at the data, actually, they all are closely like below five scores. So this is disadvantage because it ignores all the data in between and just take the two kind of extreme values, just like we talked about mean that is affected by extreme scores and range is also like that. So to overcome that why range is ignoring middle scores, we have another measure which is called interquartile range. Interquartile range also tells you about the spread of the scores, but it focuses on the middle 50% of the cases. How just like a median, we take the central value, we ignore what is on the ends, what are the extreme values, we just ignore them and we see what is the middle value. Similarly, for looking at the dispersion or variability in the data, we just focus on the middle 50% of the cases. So the formula for calculating interquartile ranges, you calculate Q3, you calculate Q1 and then you subtract Q3 and minus Q1 to get the interquartile range. So it gives you variability within middle 50% of the cases and ignores the lawyer in the upper end of the data. This is how we calculate, it's just a rough data which is very smooth, symmetrical, straight numbers just to give you an idea that how we calculate interquartile range, it is supposingly I have data 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 and 11. So I'll calculate Q3 which is 75th percentile also and it is 8.5 because you can see that here is the 75th percentile and here is the Q1 which is 25th percentile. I'll take this value and this value and subtract it and I'll get that the dispersion or the variability within 50% is almost 5. We can do it on the group data also because I will calculate here maybe that if we have a group data you know the formula, remember how we calculated median which was L plus I over F minus and whatever you have to take out percentile for example if Q1 is that is 25th percentile so you have P and then you have cumulative frequency which is of the preceding class or low class which is minus to get the percentile. So for Q1 I will calculate 25th percentile here and for Q3 I'll calculate 75th percentile I'll calculate those two values I'll give the interquartile range maybe if you want I can just go through quick example let's solve an example here which will refresh our previous ideas also which where we did median and we calculated other percentile ranks I will take a very simple data so I can quickly calculate without calculators supposedly this is the data 7, 6, 5, 4 and 3 and group data means I have frequency for everything so supposedly 3, 3, 4, 3, 3 this is the frequency I have to calculate column of CF you know to calculate any percentile or median or quartile you have to have column of cumulative frequency which means that you will add if it's a descending order you will start from here which is 3, 3 plus 4 is 7 7 plus 5 is 12 and 12 plus 16, 18 and 18 plus sorry it's a 3 plus 3 I'm actually adding X value rather than F value so I have to add 3 and then the upper frequency 3 plus 3 is 6 then 6 plus 4 is 10 then 10 plus 3 is 13 and then 13 plus 3 is 16 so first of all I have to calculate it is q3 minus q1 I have to calculate what is the q3 which is a 75th percentile in my data so it is 73 divided by 100 out of what is the F column total? it is 3 plus 3 is 6 plus 4 is 10 plus 3 is 13 plus 3 is 16 so here is the I'll just get 25, 3 is 75 24, 4 is 100 and then 4, 4 is 16 which is 4 into 3 is 12 and now I'll calculate q1 in my data which is 25 divided by 100 into 16 so similarly I'll calculate which is 4 so q1 is 4, q3 is 12 remember this is a position in the data this is not the score actually we have to find out that the score lying at the fourth percent position is a q1 and the score lying at the 12th position is a q3 we will plug the values in the formula to calculate the q3 which is L plus I over F into percentile which you have drawn minus cumulative frequency below and we will draw q3 first which is because we have to identify for q3 which class is 12 by looking at the cumulative frequency column so this is the cumulative frequency column and our 12 will be in this class so this is called a model class for q3 so we have to take its lower limit which is 6 we don't have the data plus I is the class interval which is 3, 4, 5, 6, 1 so we will do 1 for the frequency of this class which is 3 so 1 by 3 and then we have a percentile we have drawn a q3 which is 12 this one so 12 minus cumulative frequency below which is this one so minus 10 which is equal to 10 minus 2 is 2 2 divided by 3 is kind of how much is point something will come 2 divided by 3 6.6 something point 6 and something so it is 6.6 we have q3 similarly we will draw q1 and we will put values for q1 we have drawn 4 for q1 and 4 comes in this frequency column cumulative frequency column so this is the model class for q1 so we will take its lower limit which is 4 plus we will do i over f1 by 3 and then we have its frequency 3 of the model class and after that we have drawn a percentile which is 4 for q1 we will plug in 4 and minus cumulative frequency below which is this one 4 minus 3 when we solve this 4 minus 3 is 1 1 divided by 3 is 0.3 so 4 plus 0.3 is something like 4.3 so now we will put values let's suppose 6 and 4 we will round it 6.6 minus 4.3 which is almost roughly 2 so our inter quartile range which is q3 minus q1 we have drawn that is our answer inter quartile range so this is how it gives you the information about the middle 50% of the cases in your middle 50% cases what is your spread of score so this is how we calculate inter quartile range has their own disadvantages and advantages because as we talked that the limitation of using range was that it was ignoring the middle data and just focusing on the two extreme values so to answer that query we have a quart inter quartile range in which we focus on middle 50% cases so the benefit of this is that the higher scores are concentrated in our middle you tell the variability about the middle cases it's easy to calculate because usually if your data is smooth that is your q3, q1 as we have identified in ungrouped data and eliminates influences of extreme scores as we talked about the range of extreme scores but the drawback of this is that you miss a lot of information what are extreme values bottom in the top 25% cases and still we really want to know the variability in the whole data that how the scores are spread out without looking at the extreme values or without looking at the middle 50% of the cases so we have a mean deviation also a mean deviation is that every score minus the mean we take into account each x value and see how it is deviating or distancing or variability how much is each score we take each value subtract from the mean and then we add the total but the problem is that whenever you minus every score in a data with the same mean that summation will always be zero as I have given an example because the mean is your central point your basically balancing point so when you minus the center of every score of the data then it will be zero as I have shown in this example so this is not very much used so we will talk in the next what are the best measures for knowing about the variability in the data and dispersion in the data