 Assalamu alaikum. Welcome to lecture number 8 of the course on statistics and probability. Students, you will recall that in the last lecture, I discussed with you the concept of central tendency. In particular, we discussed the arithmetic mean, the weighted mean, the median in the case of raw data and in the case of the frequency distribution of a discrete variable. Today, I will continue with the concept of the median and I will discuss with you its computation in case of the frequency distribution of a continuous variable. Later, we will go on to some other measures of central tendency. In case of the frequency distribution of a continuous variable, the formula for the median is l plus h over f into n over 2 minus c. This formula may l, h, f, n or c. The first thing you have to note is that the first step is to compute the value n over 2 that is the total number of observations divided by 2. Next, we will construct the column of cumulative frequencies and locate this number n over 2 in the column of cumulative frequencies. Come, I will discuss the example of the EPA mileage ratings. As you now see on the screen, the column of cumulative frequencies for that example is 2, 6, 20, 28 and 30 and dividing 30 by 2, we obtain the number 15 and if we want to locate this number in this column of cumulative frequencies, it is obvious that the 15th value lies in the third class whose cumulative frequency is 20 and does not lie in the second class whose cumulative frequency is only 6. This means that the median class that is the third class 36.0 to 38.9. We need a median central value. If you arrange 30 cars according to their mileage, then you get the 15th value or the 15th or the 16th value that is the exact number that is required. So, for this, when we divided 30 by 2, we got the number 15 which is the mileage that we want, we located it in the column of cumulative frequencies that which mileage group and we found that it is the third group and therefore, the median lies in this third group. Now, that we have located the median class students, let us go back to the formula that I related a few minutes ago. X tilde equals to l plus h over f into n over 2 minus c. In this formula, l is the lower boundary of the median class, h is the class interval of the median class, f is the frequency of the median class, n by 2 is 15 as explained earlier and c represents the cumulative frequency of the class immediately preceding the median class. So, locating all these numbers in this example of EPA mileage ratings, we find that l is 35.95, h is 3, f is 14 and c is 6 and substituting all these numbers in the formula X tilde that is the median comes out to be 37.88 miles per gallon. This come of whom, we can say that 15 cars have mileage less than 37.88 miles per gallon or up to this value and 15 cars have mileage more than this value. Alright, let us now apply this concept to the example of the managers of the child care centers that we discussed in the last lecture. You will recall that the statement of the example was, the following table contains the ages of 50 managers of child care centers in 5 cities of a developed country. The ages are 42, 26, 32 and so on. Having converted this data into a frequency distribution, we would like to find the median age. Alright, students, you will recall that following the various steps involved in the construction of a frequency distribution, we obtained class intervals 20 to 29, 30 to 39, 40 to 49 and so on and the frequencies were 6, 18, 11, 11, 3 and 1. Now, the median is given by X tilde is equal to L plus H over F multiplied by N by 2 minus C, where L is the lower class boundary of the median class, H is the class interval of the median class, F is the frequency of the median class and N is the total number of observations. C stands for the cumulative frequency of the class preceding the median class. So, first of all, we construct the column of class boundaries as well as the column of cumulative frequencies. As you now see on the screen, the class boundaries are 19.5 to 29.5 as the boundaries of the first class, 29.5 to 39.5, 39.5 to 49.5 and so on. And if we look at the fourth and last column of the table, we find that the cumulative frequencies are the first value 6 exactly the same as the frequency of the first class and then 6 plus 18 gives us 24, 24 plus 11 gives us 35 and so on. Now, first of all we have to determine the median class that is that class for which the cumulative frequency is just in excess of N by 2. In this example, N is equal to 50 implying that N by 2 is equal to 25. And therefore, as we can see in the table, the third class of our frequency distribution is the median class. Having determined the median class, we see that L is equal to 39.5, H is equal to 10, F is equal to 11 and C is equal to 24. Substituting these values in the formula, we obtain x tilde is equal to 40.4. Thus, we conclude that the median age is 40.4 years. In other words, 50 percent of the managers are younger than this age and 50 percent are older. As I conveyed to you in the last lecture, the median student is that average which is much preferable to the arithmetic mean in that situation when our data set contains a few very high or very low values. Of course, this is not the case in this example that we just discussed, but generally this point should be kept in mind. The median is also very, it is a useful measure in that particular situation where our frequency distribution is an open-ended frequency distribution. Open-ended distributions are very common. Let me explain this to you with the help of an example. Suppose that the wages of the workers in a factory are as you now see on the screen. 100 workers have monthly income less than rupees 2000, 300 workers have income between 2000 and 2999 rupees, 500 have income between rupees 3000 and 3999 and so on. If you notice for the last class, we have the information that 50 workers have income which is rupees 5000 and above. This example shows that in the first class and the last class, both of these are what are called open-ended classes. The reason for this is that we do not have the lower limit of the first class and we do not have the upper limit of the last class. When we say that they have income rupees 5000 or higher, now we do not know what exactly do we mean by higher. Is it up to rupees 6000, 7000, 10000? That is unknown and similarly for the first class, if we say it is less than 2000, we do not know where we are starting from. Hence in this type of distribution case, if you want to compute arithmetic mean, then you will have a little problem. The reason for this is that as you remember, the formula for arithmetic mean is x bar is equal to sigma fx over sigma f and x represents the mid-points of the various classes. When you do not have the lower limit of the first class, then you do not have the mid-point to compute. You can only imagine that the lower limit might be this much and accordingly the mid-point of that class might be a certain value. Similarly, for the last class, because you do not know what is the upper limit, you will only at best estimate it and hence you will have an estimated value for the mid-point of the last class as well. As I said, the median is given by L plus h over f into n by 2 minus c or L h, they pertain to the median class, which would generally lie somewhere in the middle of your frequency distribution. Usme wo shuru valley open ended class, yavo last valley open ended class, involve hi nahi horei and hence there is no problem in computing the median. Now, that we have discussed the arithmetic mean, the median and the mode, which we did in the last lecture or the one before. Students, the next concept that I am going to discuss with you is the empirical relation between the mean, median and the mode. Empirical, ye jo loves hai, empirical is ka matlab hai, something that is based on observation. Is ka matlab ye hai ki ye relation jo mai aap ka saath discuss karne wali hoon, it does not have any rigid mathematical formula, rather it is something that has been observed when dealing with real data sets. Students, this concept relates to the relative positions of the median, mean and the mode on the x axis of our frequency distribution. Aap ko yad ho ga, main aap ke saath absolutely symmetrical distribution, moderately skewed distribution, positively skewed, negatively skewed, it is tamam concept diskaas kie the. Saap se pehle jo absolutely symmetrical distribution hai uski baat krte hai. Students, iss distribution mai the median, the mode and the mean, they all lie at exactly the middle of your distribution. In other words, they coincide at the point which is at the exact center of your distribution. Ye jo point mein aap ke saath discuss kia, ye ab screen par aap ke saamne hai, aur jaisa ke aap dek rahe hai, the mean, median and the mode all lie at the same point and that point which is at the exact center of the distribution. But students, in case of skewed distribution, these three values do not lie at the same point, rather they are pulled apart. And they are pulled apart in a certain way which I am now going to explain to you. As you now see on the screen, in case of a positively skewed distribution and a very important point to note is that the distance between the median and the mode is approximately double of the distance between the median and the mean. Is baat ko haam algebraically iss tara se express kar sakte hai, ke the median minus mode is approximately two times the mean minus the median. Now, if we solve equation one or if we solve equation two, students, in both situations we obtain the approximate relationship that you now see on the screen. The empirical relation between the mean, median and the mode comes out to be, mode is approximately equal to three times the median minus two times the mean. Students, yeh tamam discussion, main abhi aap ke saad jo ki hai, I did it with reference to the moderately positively skewed distribution. Lekin yeh tamamtha discussion, it is valid in the case of a negatively skewed distribution as well. Farkh sirf yeh hoga ke jo pattern aap ne abhi positively skewed distribution ke liye dekhah, yaani the mean was bigger than the median and the median was bigger than the mode. Ab uske opposite situation hogi, the mean will be less than the median and the median will be less than the mode. I would like you to take it up as an exercise, ke aap negatively skewed distribution ke liye iss ko khud iss ko study kare aur dekhin ke exactly wohi empirical relation aap ko milegi jo ke positively skewed distribution ke case me, main abhi aap ke saad discuss ki hai. A very important point to note is that this relation does not hold in case of an extremely positively or negatively skewed distribution. In other words, the J shaped or the reverse J shaped distribution. It does hold in case of the moderately skewed distribution aur jaisa ke we aap ko pehle gayi defa batah chuki hoon. Moderately skewed distribution hi boh distribution hai jo aap most frequently encounter karte hai with real life data sets. Let us try to verify this relation for the data of the EPA model mileage ratings that we have been considering for the past few lectures. Students, you will recall that the frequency distribution for that example was the class limits were 30.0 to 32.9, 33.0 to 35.9 and so on. The frequencies were 2, 3, 4, 5, 6, 6, 7, 8, 7, 8, 8, 9, 9, 4, 14, 8 and 2. Also, the histogram of the frequency distribution was as you now see on the screen and the frequency polygon and the frequency curve were as you now see. Now, students it is clear from all these diagrams that this particular frequency distribution is only slightly skewed. As I mentioned earlier, the empirical relation between the mean median and the mode holds for moderately skewed distributions and not for extremely skewed ones. Hence, in this particular example, since the distribution is only very slightly skewed, therefore, we can expect the empirical relation to hold reasonably well. Students, you will recall that in this particular example, the arithmetic mean was 37.85, the median was 37.88 and the mode came out to be 37.88. Now, the close proximity of these three measures of central tendency provide a strong indication of the fact that this particular frequency distribution is indeed very slightly skewed. Now, the empirical relation between the mean median and the mode is given by mode is approximately equal to 3 times the median minus 2 times the mean. Substituting the values of the median and the mean in the right hand side of this relation, we obtain 3 times 37.88 minus 2 times 37.85 equal to 37.94. Now, students, the mode is equal to 37.825 and we notice that it is indeed very close to 37.94, the value that we just obtained for the right hand side of this relation. Hence, the empirical relation is verified. Now, students, the mode is equal to 37.825 and we notice that it is indeed very close to 37.94, the value that we just obtained for the right hand side of this relation. Hence, the empirical relation is verified. Let me now extend this concept of partitioning that we have done in the case of the median, that we partitioned our distribution into two equal parts. Let me extend this now to the partitioning of the distribution into four parts, ten equal parts or hundred equal parts. In other words, I will now be talking with you about quartiles, deciles and percentiles. Quartiles are values that divide the data set into four equal parts. They are denoted by q 1, q 2 and q 3. You will agree with me that if you want to divide a data set into four parts, you will require three quantities. Similar to what we had in case of the median, that we wanted to divide the data set into two equal parts and we needed just one quantity for that purpose and that was the median. Let me now give to you the formulae of the three quartiles. As you now see on the screen, the first quartile is given by l plus h over f into n by 4 minus c. The second quartile is given by l plus h over f 2n over 4 minus c and the third quartile as l plus h over f into 3n by 4 minus c. Students, I would like you to note three things. First, I would like you to note three things in their regard. The first point is that these formulas are valid in the case of the frequency distribution of a continuous variable. The second thing is that I hope that you have been able to detect a certain pattern in the three formulae that I have presented to you. You have n over 4 in the q1 formulae. In the q2 formulae, it is 2n over 4 and in the q3 formulae, it is 3n over 4. What is the reason for this? The reason is that the first quartile is of that value here. Just say, you have 25 percent, that is one fourth of the observations. Second quartile is of that value here. Just say, you have 50 percent, that is n by 2, that is 2n by 4 of the n observations or third quartile is of that value here. Just say, you have 75 percent, that is three fourth of the observations. The third and last point that you must have noted is that the formula of the second quartile is exactly equivalent to the formula of the median. After all, median kya chees hai? Wo bhi toh wohi value thi na jis se pehle you have 50 percent of the observations. The relative positions of the three quartiles are as you now see on the screen. The first quartile has 25 percent of the values to its left and 75 percent to its right. The second quartile, that is the median has 50 percent to its left and 50 percent to its right and the third quartile q3 has 75 percent to its left and 25 percent to its right. Ye toh wohi quartiles. Deciles or percentiles johe unki bhi bilkul issi tara ki logic hai. The deciles are those nine quantities that divide our distribution into ten equal parts and the percentiles are those 99 quantities that divide our distribution into 100 equal parts. Ab deciles ke liye formulae ki sakal kya hogi in case of the frequency distribution of a continuous variable. Bilkul issi tara jis tara bhi aapko quartiles ke liye bataya. The formula for the first deciles will be D1 is equal to L plus H over F into N by 10 minus C. The second deciles will be L plus H over F into 2 N by 10 minus C. The third deciles is L plus H over F, 3 N by 10 minus C and so on. I hope that you will be able to judge easily that the fifth decile is exactly the same thing as the median because 5 N over 10 is exactly the same as N over 2 and exactly the same situation for the percentiles. The formula for the first percentile will be L plus H over F N by 100 minus C. The second percentile L plus H over F 2 N over 100 minus C and so on. Or this situation may the 50th percentile is exactly the same thing as the median. The 25th percentile is exactly the same thing as the first quartile and the 75th percentile is none other than the third quartile. I hope that you will be able to establish all these points very clearly in your mind. Or agar aap thodisi working kare, to aap dekhenge case me, kisi kisi koi difficulti aap ko peshne hi aegi. Students, I would like you to note the difference between the word quartile and the word quantile. Also, these quantities are called fractiles because they divide our distribution into various parts or fractions. Now, students, let me illustrate the computation of the quantiles with the help of the example of the ages of the managers of the child care centers. You will recall that the frequency distribution for this example was 20 to 29, 30 to 39, 40 to 49, 30 to 49. And so on. And the frequencies were 6, 18, 11 and so on. Suppose that we wish to determine the first quartile, the sixth desile and the 17th percentile. We begin with the first quartile which is also known as the lower quartile. It is given by q 1 is equal to l plus h over f into n by 4 minus c. First of all, we find n by 4 and in this example it is equal to 12.5. Now, the cumulative frequency of the first class is 6 whereas, the cumulative frequency of the second class is 24. Since, 12.5 lies between 6 and 24. Hence, it is obvious that the first quartile lies in the second class. Hence, l is equal to 29.5, h is equal to 10, f is equal to 18 and c is equal to 6. Therefore, the first quartile is 29.5 plus 10 over 18 multiplied by 6. So, the fourth quartile lies between 12.5 minus 6 and that is equal to 33.1. So, the interpretation is that one fourth of the managers are younger than age 33.1 years and three fourth are older than this age. All right students, next we compute the sixth design which is given by l plus h over f 6 n over 10 minus c. So, first of all, we compute 6 n over 10 and that comes out to be 1 minus c. So, first of all, we compute 6 n over 10 and that comes out to be 6 into 50 over 10 and that is 30. Now, the cumulative frequency of the second class is 24 whereas, the cumulative frequency of the third class is 35. Our number 30 lies between 24 and 35 and hence, it should be obvious that the sixth design lies in the third class of our frequency distribution. Hence, l is equal to 39.5, the class interval h is equal to 10, the frequency of that particular class is 11 and c, the cumulative frequency of the class preceding that particular class is 24. Substituting all these values in the formula, the sixth design comes out to be 44.95. This means that sixth tenth or in other words, 60 percent of the managers are younger than age 44.95 years and four tenths are older. Last but not the least, we compute the 17th percentile, which is given by the formula l plus h over f 17 n over 100 minus c. So, computing 17 n over 100, we obtain 8.5. Now, since the cumulative frequency of the first class is 6 and the cumulative frequency of the second class is 24, therefore, it is clear that this particular value, the 17th percentile lies in the second class of our frequency distribution. Hence, l is equal to 29.5, age is 10, f is 18 and c is equal to 6 and substituting these values in the formula, the 17th percentile comes out to be 30.9. Similar to previous interpretations, the interpretation of this result is that 17 percent of the managers are younger than age 30.9 years and 83 percent are older than this particular age. Students partitioning is concept key significance. Why is it that we are wanting to divide our distribution into all these different parts? The answer to this question is that in many situations we are interested in the relative quantitative location of our measurement. Quantiles provide us with an easy way of achieving this. Let me explain this to you with the help of an example. If oil company A reports that its yearly sales are at the 90th percentile of all the companies in that particular industry, the implication is that 90 percent of all the oil companies have yearly sales less than company A's and only 10 percent have sales exceeding that of company A. If the company A's sales exceeding the point on the x axis, the area under the curve is 90 percent of the total area and the area to the right is 10 percent. This is the concept of an interesting example. Suppose that you sit for a particular exam and you want to know where you stand with reference to the rest of the class. If your marks are more than 90th percentile, this means that you are among the top 10 percent of your class and that is great. The next concept that I would like to discuss with you is the graphic location of the quantiles. Let me explain this point with reference to the same example that we are very fond of that of the EPA mileage ratings of cars. As you will recall, the statement of that example was suppose that the environmental protection agency of a developed country performs extensive tests on all new car models in order to determine their mileage rating. Suppose that the following 30 measurements are obtained by conducting such tests on a particular new car model, 36.3, 30.1, 40.5, 40.5, 40.5, 40.5, 40.5, and so on. Also, you will recall that when we converted this raw data into a frequency distribution, we obtained classes as 30.0 to 32.9, 33.0 to 35.9 and so on and the frequencies were 2, 4, 14, 8 and 2. Iske Lava, Hamneski graphical representation discussed Kethi. We drew the histogram, the frequency polygon, the frequency curve and also if you recall, we constructed the cumulative frequency polygon that is the O give as you now see on the screen. Students, he joke cumulative frequency polygon here, this will enable us to graphically locate the median, the quartiles, the deciles or any percentile that we may be interested in and this is called graphic location of quartiles. Now, because the median is that value before which half of the data set is given by the lies, therefore the first step in this regard is to calculate the value n over 2. In this example, because n is equal to 30, therefore n by 2 comes out to be 15. The next step is to locate this number n by 2 on the y axis of the cumulative frequency polygon as you now see on the screen. Next, we draw a horizontal line perpendicular to the y axis starting from the point n by 2, which in this example is 15 and extend this line up to the cumulative frequency polygon as you now see on the screen. Lastly, we drop a vertical line from the cumulative frequency polygon down to the x axis. Now, if we read the x value where our perpendicular touches the x axis, students we find that this value is approximately the same as what we obtained from our formula. You will remember that when we applied formula, x tilde is equal to l plus h over f n by 2 minus c, our answer came out to be 37.9 and the answer that you obtain from the cumulative frequency polygon is approximately the same. Students, aapko andazha hogya hogha, k cumulative frequency polygon that is a very useful tool to locate the median very quickly. Bilkul ishi tara, we can locate the first quartile, the third quartile and so on. First quartile ke liye our horizontal line perpendicular to the y axis will be drawn against the value n over 4 and for the third quartile it will be drawn against the value 3n over 4. Ishi tara, I am sure that you can now judge what value you should compute if you want to locate for example, the 67th percentile. I am sure ke aapne kaha hogha 67n over 100. Aur agar nahi kaha to me aapko encourage karungi ke aapispe thora sa work kare aur is pattern ko understand karne ki koshish kare jo me aapko pehle bhi bataya tha jis vak me aapko uska formula explain kar rahi thi. Chaliye graphic location of quartiles ko bhi humne discuss kar liye. Ab iss silsle me konsi baat rahi. Aapne note kiya ke me aapko tamam formulae jo aaj diye quartiles, desiles aur percentiles ke. They are the ones which are valid in the case of the frequency distribution of a continuous variable. Aapne shahid socha ho ke what about the situation when we have a discrete variable, ya what about the situation when we do not have a frequency distribution, rather we simply have raw data. Essay situations me bhi it is possible for us to compute the various quartiles. Lekin uska method mukhtali foga aur iss course me humare pass itning unjais nahi ho ghi ke me saari situations aapke saath in detail in lectures me discuss karun. But I would like to encourage you students to study the textbook and other books and to explore all the different variations of the formulae that I have explained to you, the variations which are valid in different situations. Also, I would like to encourage you to attempt quite a lot of questions from your exercise so that you have a lot of practice and you feel at home with this interesting idea of partitioning of a data set into various parts. So, I wish you the best of luck and until next time, Allah Hafiz.