 As-Salam-Alaikum. Welcome to lecture number 10 of the course on statistics and probability. You will recall that in the last lecture, I was discussing with you the concept of central tendency. In particular, we discussed the geometric mean and the harmonic mean and after that we discussed a relationship that exists between the arithmetic mean, the geometric mean and the harmonic mean. Also towards the end of the last lecture, I conveyed to you two other measures of central tendency, the mid-range and the mid-quartile range. Today, I am going to begin with you another extremely important concept and that is the concept of dispersion. Dispersion means the variability that exists in your data set. As you know, in any phenomenon when you collect data, all the values are not the same but they vary from one another. The variability that exists between all these values is a very important concept and it is very important for us that we should be able to have some way of measuring the amount of variability that is present in our data set. Let me explain this point to you with the help of an example. In a technical college, it may well be the case that the ages of a group of first-year students are quite consistent. For example, 17, 18, 18, 19, 18 and so on. They may all be more or less of the same age. On the contrary, a class of evening students undertaking a course of study in their spare time may show just the opposite situation as far as the ages of these students are concerned. For example, their ages may be 35, 23, 19, 48, 32 and so on. I am sure that from this example, you will be completely aware that in the first class, there is very little variability with regard to age. In the second class, there is a lot of variability. This is the topic of our today's lecture and it is a concept which is as important as the concept of average. It is a strange coincidence that a layman is very easily aware of the concept of average and he has an intuitive idea in his mind that if there is a set of data, he will have an average value. But the other concept of this with him is that the variability in that data is generally not given to a layman who is not able to give such a high importance or does not appreciate its significance as much as he does to the average. I would like to convey to you that this concept is as important as the concept of average. After all, unless there is some variability in your data set, what is it that you are trying to average? Average to you will do only if there is variability in your data. Let me try to explain this to you with the help of another example. Suppose that the sizes of the classes in two comprehensive schools in two different areas are as follows. Ab jaise ke aap is stable me dekh rahi hain, the first column represents the number of pupils in a class. In other words, the class size and these class sizes are 10 to 14, 15 to 19, 20 to 24, 25 to 29 and so on. And it goes up to quite a large class size and that is 45 to 49. Lekin ab agar aap un dono areas un dono localities ko compare karein to aap dekh rahi hain ke area A main, there is not a single school in which the class size is 10 to 14 or 40 to 44 or 45 to 49. All the class sizes in this school range between 15 to 19 and 35 to 39. On the contrary, in area B, the situation is quite different and we have 5 classes in the school whose class size is only 10 to 14. And on the other end of the table, we have 3 classes with class size 45 to 49 and 3 classes with class size 40 to 44. If we compute the arithmetic mean for both these areas, we find that the average class size for both the areas is identical and it is 27.33. Lekin students, jaisa krima ne abi aapke explain kia. Variability point of view say, the 2 areas are quite different and if we plot these 2 frequency distributions on the same graph, you will realize that the 2 distributions are absolutely different. As you now see on the screen, the spread of the distribution for area A is much less than the spread of the distribution for area B. But as I just mentioned, the average value that is the arithmetic mean for both the distributions is identical and that is 27.33. Abhi ye toh waze ho gaya ke ye distributions identical nahi hai, in spite of the fact that the arithmetic mean is the same. Toh ab crucial question yehi hai, ke how do we distinguish between the 2 distributions? Dekhi graphically toh nazar aaha hai ki they are different. But it is not enough for me to just say that it is obvious from the graph that they are different. We have to have some numerical measure to measure the variability, the spread, the scatter of the distribution of locality A as well as of locality B. And the moment we have some proper numerical way of measuring these quantities, then of course, we are able to compare the 2 distributions in a proper way. Jaisa ke aapne central tendency ke case me dekhaha tha, there were quite a few ways of measuring the central tendency of a data set. You remember all those things that we have discussed, the arithmetic mean, the geometric mean, the harmonic mean, the median and the mode. Isitara students, we have different ways of measuring the spread of our distribution. I will discuss with you 4 very basic and important measures of dispersion. These are the range, the quartile deviation, the mean deviation and the most important and the most widely used that is the standard deviation. Lekin peshtar iske ke ham enko ek ek karke pick up kare. Students, me aapko ek aur nahayat important point convey karna chaati hoon. Aur wo yeh ke yeh jo chaaru naam, me ne bhi aapko liye, these are all called absolute measures of dispersion. Absolute se murad yeh hai, ke for all these measures, your answer is expressed in the same units as your data set. Let me say this again. An absolute measure of dispersion is one that measures the dispersion in terms of the same units or in the square of the units as the units of the data. For example, if the units of the data are rupees, meters, kilograms, etc. the units of the measures of dispersion will also be rupees, meters, kilograms, etc. Escape barracks, a relative measure of dispersion is that which expresses the absolute measure of dispersion relative to the relevant average and multiplied by 100 many times. And in this way it is a pure number independent of the units in which the data has been expressed. A relative measure of dispersion is one that is expressed in the form of a ratio, coefficient or percentage and as such it is independent of the units of measurement. Relative measure of dispersion compute karne ka fahida yeh hota hai, ke because it is a pure number therefore it can be used for purposes of comparison. You are able to compare the dispersion of one data set with the dispersion of another. Let us now begin the discussion of the various measures of dispersion one by one. I start with the simplest and that is the range. The range is defined as the difference between the two extreme values of a data set that is r is equal to x m minus x naught where x m represents the maximum value and x naught the smallest. You will remember that when you first learnt how to construct a frequency distribution from raw data this was one of the very first steps that you did. You found the range and then you divided it by the number of classes that you wanted to have. Is waqt main is quantity ka zikar dispersion scatter spread of the data set ke hawale se kar raheem aur aap agree karenge ke iss se zyada simple koi measure ho hi nahi sakta to determine the spread of your data. Agar aap iss ko graphically dekhayin to jaisa ke aap screen par dekh raheem it is the distance between the smallest value which lies at the left end of your distribution and the highest value which lies on the right side. I hope that from this diagram it is obvious to you that for any distribution which has a greater variability between the extreme values this distance will obviously it will be longer and for a distribution where it is a tighter distribution and there is not that much difference between the two extreme values obviously the range will be a smaller quantity. As is obvious the range is the easiest measure of dispersion. However, students you must realize that it has two serious disadvantages. The first point is that it is based on only the two extreme values and as such it ignores all the information that is present in the intermediate values. To iss me kabad hi nikal di hai ke if we are trying to measure the variability of the data set on principle all the values should be utilized to compute this variability. Why is it that we are ignoring all this information that is in the all these values which are inside and the second point is that because of this very fact that it is only the two extreme values students sometimes the range can be quite misleading. For example, suppose that there is this test with this professor has conducted a test which was very difficult and most of the students got very low marks and the marks were something like 2, 3, 7, 4, 5 out of 20. Likin sif a student hai who is very intelligent and he is able to get 18 out of 20 in this very test. Agar aap variability of marks measure karna chahi by way of the range what will you obtain the highest mark is 18 the lowest mark is 2 and the range comes out to be 16. But you should realize students that this number 16 is not a very good representative of the variability that is present in this data set in all those marks which were quite low except for this one lone mark which was so high. Agar sif ye ek number iss data set me majood na hota to baaki numberon ki range 16 ki nisbath bohot kam hoti and in this way if you have an extreme value in your data set the range is absolutely inappropriate as a measure of dispersion. Of course there are situations where the range is appropriately used as a measure of dispersion. For example in case of statistical quality control charts in the case of stock prices and also in the case of daily temperature. Her rows are very low. We have seen and heard on television that today's minimum temperature was this much and maximum temperature was this much. Now I am going to talk to you about a technical point of view. See the way I talked about the range is very simple and of course whenever you apply karenge to aap isi tara in the high simply usko aap compute karenge. Likin the concept that I wish to convey to you now is that generally speaking for any measure of dispersion we would like to think of it as the spread of the values around a central value. Yani pehle hum apne zehnd me average ko linege aur uske baad hum wo jo spread hai wo hum us average ke around measure karna chahathe. Yani is average ke hawale se kya zyada spread hai from this average or is it only a little bit. So, let me now present to you the concept of the range with regard to an average that is relevant to the range. Aapko yad hoga ke last lecture ke end me hum ne mid range ka zikar kia tha which was defined as x naught plus x m over 2. Now the range can be defined as twice of the arithmetic mean of the deviations of the smallest and largest values around the mid range that is the range is equal to mid range minus x naught plus x m minus mid range over 2 and this whole quantity multiplied by 2 and solving this expression it will come out to be x m minus x naught exactly the same definition which I gave you earlier. So baad ye bani ke ek mid range hai aur ek uska distance hai from x m jisko bhi hum deviation kahenge aur in dono deviations ka arithmetic mean agar hum lele aur usse double karne to that is equal to our range. To ye sare iss makhmase me parne ki kya zorurati. Only to convey to you that the range also follows that basic concept that generally we are trying to measure the dispersion of our data set around a central value. Ye to hui range and as I said earlier this is an absolute measure of dispersion. Jis jin units me aap ka original data hai unhi units me range bhi express ho ghi. Now what is the corresponding relative measure of dispersion as you now see on the screen the relative measure of dispersion relevant to the range is called the coefficient of dispersion and it is given by half range divided by the mid range in other words x m minus x naught over 2 divided by x m plus x naught over 2 which on simplification becomes simply x m minus x naught over x m plus x naught. As I said earlier this relative measure of dispersion is a pure number and therefore it can be used for comparing the dispersion of two different data sets. For example, if the coefficient of dispersion of one data set comes out to be 0.6 and the coefficient of dispersion of another is 0.4 then it should be obvious that the spread of the first one is greater than the spread of the second one aur ye iss baat ko madhe naza rakte we ke donno hamne relative to the central point measure ki we hamne. The next concept of dispersion that I am going to discuss with you is the quartile deviation and it is also known as the semi inter quartile range. As you now see on the screen the quartile deviation is defined as half of the difference between the third and the first quartiles that is quartile deviation is equal to q 3 minus q 1 divided by 2. Ai ye ab iss measure ki graphical picture peh gaur karthe hain aap ko yad hoga q 1 is that quantity which has 25 percent of the data before it and 75 percent after it. Similarly, q 3 is that quantity which has 75 percent area or data before it and 25 percent after it. So, as you now see on the screen the quartile deviation or in other words the semi inter quartile range is the horizontal distance which is exactly half of the distance between the first quartile and the third quartile. To aap ne dekhha ke quartile deviation bhi ek horizontal length hi ke ze ye express hoti ye. Why is it that in both the cases the case of the range and in the case of the quartile deviation we are having this distance in a horizontal direction? Well, students I hope that the answer is obvious to you ye jo frequency distribution aap ne plot ki hain y axis jo hain wo toh frequencies ko denote kara hain yani how frequently a certain x value occurred. Likin jo x values hud hain jo hamara variable of interest hain jo hamari data values hain wo toh bahar haal x axis par hi represent ho rahi hain hain. Le haza agar hum unki aapis me variation ko measure karna chahti hain toh saaf zahir hain ke that variation is going to be along the x axis. Ye hi waja hain ke iske baad jab me mean deviation aur standard deviation ka zikar karungi toh evan those quantities will be expressed as horizontal distances below the x axis. Kuch ter pehle jab me range ki baad kar rahi thi toh me ne waze toh pe aapko convey kiya ke agar humare data set me koi extremely large yaa extremely small value ho compared with the rest of the values then the range is not at all suitable as a measure of spread. Aap ye jo quartile deviation hain students is me aap note karenge ke wo jo range ka problem tha that has been overcome. Aap jo distance hain it is the distance between q 1 and q 3 aur uske baad usko hum haaf kar dete. Aur q 1 jo hain that is not on the extreme left of the distribution q 3 jo hain that is not on the extreme right uske aage toh pachchis percent of the data lie karta hain iss se pehle bhi 25 percent of the data lie karta hain. Lahaza there is no problem now the kind that we had in case of the range. So, in this regard the quartile deviation is superior to the range as a measure of dispersion. Aay ye aap hum quartile deviation ko bhi uss tara interpret karne ki koshis karte jis tara aapse thori de pehle main range ke liye kiya tha. As you now see on the screen the quartile deviation can also be viewed as the arithmetic mean of the deviations of the first and third quartiles around the median that is the quartile deviation is equal to m minus q 1 plus q 3 minus m over 2 where m represents the median and solving this expression it comes out to be equal to q 3 minus q 1 over 2 exactly the same formula that we have for the quartile deviation in its basic definition. Toh pheer wohi pehle ki tara ki baat ke ek toh hua median ek hua first quartile aur ek hua third quartile. Now the deviation that is the distance between the median and the first quartile and then on the other side the distance between the median and the third quartile. We just simply take the arithmetic mean of these two deviations and this gives us a way of measuring the dispersion of our data set. Is discussion se ape waze ho gya hoga ke quartile deviation hum us situation me istimal karenge as a measure of dispersion jis situation me hum median ko istimal karenge as the most suitable measure of central tendency. Students quartile deviation possesses quite an attractive property and that is that the median plus minus the quartile deviation contains approximately 50 percent of the data. As you now see on the screen the median of course lies in the middle the median minus the quartile deviation will be a point to the left side and the median plus quartile deviation is obviously to the right and the property that I have just conveyed to you says that 50 percent or approximately 50 percent of the data will lie between these two points median minus quartile deviation and median plus quartile deviation. So, I will give you a question. First, we have two distributions. Both are symmetric and both have the median. But one quartile deviation is equal to the other quartile deviation. As you situation me if you draw the graph of both the distributions on the same graph paper, what will you have? Obviously, the one which has the double quartile deviation that will be much wider than the one which has the smaller quartile deviation. Let us now apply this concept of quartile deviation to an example. Suppose that the share holding structure of two companies is as given in the table that you see on the screen. For company X, the first quartile is 60 shares, the median is 185 shares and the third quartile is 270 shares. On the other hand, for company Y, although the median is exactly the same as that for company X, but the first quartile and the third quartile are quite different. The first quartile is 165 shares and the third quartile 210 shares. The quartile deviation for company X therefore, comes out to be 270 minus 60 over 2 and that is equal to 105 shares. On the other hand, for company Y, it is equal to only 22 shares. Students from these computations, I hope that it is obvious to you that there is considerable concentration of the shareholders around the median number of shares in company Y. In company X on the other hand, we do not find this kind of a concentration. There is an approximately same number of small, medium and large shareholders. The answer is very simple. Just look at the quartile deviations for the two companies again. For company X, the quartile deviation is 105 shares, but for company Y, the quartile deviation is only 22 shares. The larger the quartile deviation, the greater is the scatter of the values within the series and hence it is obvious that there is greater concentration around the median in company Y as compared with company X. From the above example, it is obvious that the larger the quartile deviation, the greater is the scatter of the values within the series. Now, the question is how do we compute the quartile deviation in the case of raw data? For this purpose, let us go back to the example of the US Zoological Parks that we considered in lecture number 7. As you will recall, the example was displayed in the following table are the annual attendance figures in millions of visitors of 32 US public zoological parks. The figures are 0.6, 0.9, 0.2 and so on and all these figures are in millions. Now, we would like to compute the median, the first and third quartiles and the semi-interquartile range for this data. Of course, you all know that the semi-interquartile range is the same thing as the quartile deviation. In order to determine the values of the median and the upper and lower quartiles students, we must first arrange the attendance figures in ascending order. As you now see on the slide, the arranged figures are 0.3, 0.4, 0.5 and so on. Now, because we have an even number of values that is 32, therefore the median is given by the average of the two middle values which are 1.0 and 1.1. Therefore, the median which is also known as the second quartile is equal to 1.0 plus 1.1 and this sum divided by 2 and therefore, we obtain 1.05. As far as the quartiles are concerned students, I would like you to first note the following general rule. In order to compute pj, the jth percentile from a set of n observations arranged in order from smallest to largest, we need to proceed as follows. Number 1, when jn over 100 is an integer, the jth percentile is given by the average of the jn by 100 and the jn by 100 plus 1th observations. Now, before I go to the second part of this rule, students, let us apply this first part to this particular example because as you will just see, this first part is applicable in this particular example. We wanted to find the first quartile q1 and q1 is the same thing as p25. Therefore, according to the rule, we are talking about j equal to 25 and therefore, jn over 100 is equal to 25n over 100 and because n is equal to 32, therefore, 25 into 32 divided by 100 comes out to be equal to 8 and 8 is an integer. Therefore, according to the rule that I just mentioned, we have to compute the average of the jn by 100 that is the 8th and the jn by 100 plus 1th that is the 9th observation and students, the 8th observation in our ordered dataset was 0.6 whereas, the 9th observation was 0.7 and therefore, the average of the 2 comes out to be 0.65. So, this is the first quartile or in other words, the 25th percentile of this particular dataset. Similarly, in order to compute q3, we note that q3 is the same thing as p75 and therefore, we are talking about j is equal to 75 and hence, jn over 100 comes out to be 24 and that is also an integer. Therefore, once again we apply the same rule and we need to find the average of the jn by 100 in other words, the 24th and jn by 100 plus 1th in other words, the 25th observation. So, students in our ordered dataset, the 24th value is 1.4 and the 25th value is 1.5 and therefore, finding the average of the 2, our third quartile comes out to be 1.45. Now, the inter quartile range is given by q3 minus q1 and therefore, subtracting 0.65 from 1.45, we obtain 0.8 million or in other words, 800000 as the inter quartile range. Now, students what is the interpretation of this result that we have just obtained? It means that the middle 50 percent of the attendance figures span a range of 0.8 million. This value is displayed along with the quartiles and the attendance data in the slide that you now have on the screen. This is the idea that inter quartile range covers the 50 percent values of the dharmyaan. But students, you remember that we were interested in the quartile deviation which is none other than the semi inter quartile range. Therefore, all we have to do now is to divide what we just obtained you know 0.8 to divide this by 2 and doing so, we obtain the quartile deviation equal to 0.4 million. Now, before I end this discussion students, I would like you to go back to the general rule that I started from regarding the computation of the jth percentile from a set of n observations arranged in order from smallest to largest. In this regard, the point to be noted is and this is point number 2 versus point number 1 that I stated earlier that if jn over 100 is not an integer then the jth percentile is given by the value of the jn by 100 plus 1th observation. But students please note that I have a special type of a bracket around jn over 100 in this particular expression and this kind of an expression with this kind of a bracket stands for the largest integer in jn over 100. Now, students let me explain this with the help of a simple example and using the same example that we were just considering. Suppose that we are interested in computing the 7th percentile for the data of the 32 zoological parks then j will be equal to 7 and we will have jn over 100 is equal to 7 into 32 divided by 100 and this is equal to 2.24. Obviously, 2.24 is not an integer and according to the rule that I just mentioned the 7th percentile will be given by the 2 plus 1th that is the third observation and why is this the case because students the largest integer in the number 2.24 is 2. So, we consider this largest integer and we add 1 to this number and that gives us 3 and we say that the 7th percentile is the value of the third observation. So, students this is the way in which we compute the various percentiles quartiles and deciles for raw data. Students the quartile deviation is an absolute measure of dispersion. Absolute measure is that which is expressed in exactly the same units as the units of our data set. What is the relative measure of dispersion relevant to the quartile deviation? Students it is known as the coefficient of quartile deviation and is expressed as the quartile deviation divided by the mid quartile range. As you will recall in the last lecture the mid quartile range was defined as q 3 plus q 1 over 2 and if I divide the quartile deviation that is q 3 minus q 1 over 2 by the mid quartile range that is q 3 plus q 1 over 2 solving this expression we obtain q 3 minus q 1 over q 3 plus q 1. Once again this is a pure number and it can be used to compare the dispersion of two or more sets of data. Students quartile deviation is definitely an improvement on the range as I explained a short while ago. You simply pick up the first quartile and the third quartile and you find their difference and you take half of it. The next two measures of dispersion that I am going to discuss with you the mean deviation and the standard deviation. These are two such measures which are based on each and every observation in our data set. as the central value around which we were measuring the dispersion. Now the question is how will we measure the dispersion of the values around the arithmetic mean? You know that the most frequently used average is the arithmetic mean. So, it is obvious that our first approach in this series will be very similar to what we did just now in case of the range and in case of the quartile deviation. And exactly that we will do now that we want the arithmetic mean to be considered as the central point and that is that we measure the distance of each and every data value from the mean. Denoting these distances by small d and considering their absolute values the formula for the mean deviation becomes mean deviation is equal to sigma modulus of d over n. Now in this formula we have to note two or three things. The first thing is that you have noted that I have considered absolute values. The reason for this is that if I just talk about deviations without considering the absolute values then some of the deviations will be positive and some of the deviations will be negative and if I sum them my answer will be 0. But when we take absolute value then this is because negative deviations become positive and when we sum them we obtain a positive answer and dividing that by the number of deviations we are obtaining the average deviation or this measure is mean deviation. After all mean who says the average? So, we have to take deviations of all the values from the mean. We have to take the absolute value of every deviation and after that we have to absolute deviations go average. Is formulae key detailed discussion? And after having discussed it in the case of raw data we will proceed to the case of grouped data that is when we have a frequency distribution. After that we will proceed to the most important measure of dispersion and that is the standard deviation. In the meantime I would like you to practice the concept of the range, the coefficient of dispersion, the quartile deviation and the coefficient of quartile deviation. My best wishes to you and until next time Allah Hafiz