 As Salaamu Alaikum, welcome to lecture number 12 of the course on statistics and probability. You will recall that in the last lecture, I was discussing with you the concept of dispersion. In particular, I discussed with you the mean deviation and the standard deviation also towards the end of the lecture, we discussed the relative measure called the coefficient of variation. Today, I will continue with the concept of the standard deviation and I will convey to you a very interesting and important inequality called the Chebyshev's inequality. Students, this inequality and the empirical rule, another important rule that I will discuss, this will help us understand the significance and the role of the standard deviation in interpreting properly the dispersion or the spread of our data set. You will recall that last time, we discussed this with a lot of detail about the standard deviation and the coefficient of variation, what is the role of the standard deviation is in this case, if we want to compare two or more data sets, but students, we have not focused as much on the question, how do we interpret the standard deviation in the case of one single data set? The Chebyshev's inequality and the empirical rule, these are two different ways of answering this particular question. To understand this point properly, students, aise ke ek particular data set ko apne zehen melaye and try to answer the following questions. Question one, how many data values fall within one standard deviation of the mean? Question two, how many data values fall within two standard deviations of the mean? And question three, how many data values fall within three standard deviations of the mean? Students yeh me ne aapse kya question puchehen, aapko yaad hai ke last time me ne badi tafseel ke saath aapko yeh convey kia tha, ke range, quartile deviation, mean deviation or standard deviation. Sab ke sab, they can be represented as a horizontal distance that we draw below the x-axis, a distance which enables us to understand the spread of our distribution. Agar aap iss point ko zehen me rakhein, toh in questions ka jawab koi asa muskil nahin. Pela question tha, how many data values fall within one standard deviation of the mean? Iska kya matlab hai? Very simply, I just want you to think that there would be a certain number and a certain proportion of the data values that will be lying between x bar minus s and x bar plus s. Ussi frequency distribution ko aur uski jo curve hai usko zehen me laaye. x bar is in the middle, x bar minus s is to the left. Kyuke jab aap s distance x bar me se minus karenge, toh obviously you will, as if you will travel backward and you will travel by a distance of one standard deviation. Toh x bar minus s aapki distribution ke hoga, not in the very beginning, but towards the left side or x bar plus s towards the right side and there is a certain amount of data that lies in this range. Similarly, a certain amount of data will lie between x bar minus 2 s and x bar plus 2 s and similarly, for x bar minus 3 s to x bar plus 3 s. Aap sabali beda hota hai, ke ham kistra ye determine karenge ke in ranges ke darmean kitna amount of data lie kartha hai? Well, it is obvious that if we have a particular data set, all we have to do is to first of all compute x bar and s and then of course, to find all these values x bar minus s, x bar plus s, x bar minus 2 s and so on. Aap uske baad ham count karenge, actually we will count the number of values that are lying in the inside these ranges, but students the theorem that I am wanting to convey to you, which is called the Shebyshev's theorem or the Shebyshev's inequality. This is a theorem that provides a general answer to this question and it is obvious that if we have a general answer, which is valid in every situation that is much better. So, as you now see on the screen, the Shebyshev's theorem states that for any number k greater than 1, at least 1 minus 1 over k square of the data values fall within k standard deviations of the mean, that is within the interval x bar minus ks to x bar plus ks. Ab iss theorem mein 2, 3 baathe bohot important hai. Peheli chee je note kee je that I said that the number k is a number, which is greater than 1 or dosri bohot important baat ye bhi note kee je that I said that for any number k greater than 1, at least 1 minus 1 over k square of the data values fall between x bar minus ks to x bar plus ks. Yani ye jo do alfaz hai at least that is very important. Aayi iss theorem ko zara detail mein samajnegi koshish karte hain. Jaisa mein hain kaha ks a number greater than 1. So, let me take a very very simple simplest case and let me take k equal to 2. Agar aap k equal to 2 apne us expression mein substituted kare 1 minus 1 over k square to zahire k 1 minus 1 over 2 square that is 1 minus 1 over 4 and that is equal to 3 over 4 iss ka matlab ye hoa k ham shebhi chef's theorem ko k equal to 2 ke case mein iss tara se pahenge that at least three fourth of the data values fall between x bar minus 2s and x bar plus 2s. Similarly, if I put k equal to 3, 1 minus 1 over 3 square is equal to 1 minus 1 over 9 and that is 8 over 9. And what I am saying is that for any data set at least 8 over 9 of the values 8 over 9 proportion of the values falls between x bar minus 3s and x bar plus 3s. Students ye jo shebhi chef's theorem hai this is valid for any data set may it be a sample drawn from a population or may it be an entire population iss ki jo limitation hai wo ye hai ke it does not provide any information for the case k equal to 1 iss liye ke agar aap us expression 1 minus 1 over k square mein k ko 1 ke baraabar put karein to saab zahere k 1 minus 1 over 1 square will be equal to 1 minus 1 and that is 0. And we are saying that at least 0 percent of the data lies between x bar minus s and x bar plus s and this obviously does not make much sense. Ye jo mein aap ko abhi values aap ke saath calculate ki 3 over 4, us case mein when k is equal to 2 and 8 over 9 in the case when k is equal to 3, zahere hai ke haam isko percentage form mein bhi apne zahen mein betha sakte hai and it is easy to remember that at least 75 percent of the data values will always lie between x bar minus 2 s and x bar plus 2 s and similarly at least 89 percent of the values will fall between x bar minus 3 s and x bar plus 3 s ye jo 89 percent mein kaha ye mein a round karke kaha, warna 8 over 9 is actually a little less or students issi tara aap ke ki koi bhi value put karenge greater than 1 and you will get the corresponding figure of the minimum amount of the data that lies in that range. Lekin ek point bada interesting or important iss marhalai par hai or wo ye ke in many situations the Shevychevs inequality provides weak information regarding the amount of data values that fall between these ranges. For many data sets the ones which would lead to mound shaped approximately symmetrical distributions students the actual proportion of data values that lies between x bar minus 2 s and x bar plus 2 s for example, is much greater than 75 percent and similarly the actual proportion of data values which would lie between x bar minus 3 s and x bar plus 3 s for a mound shaped symmetric distribution again that would be much greater than 89 percent. So, this means ke bas situations to definitely assi hain jab Shevychevs inequality or Shevychevs theorem comes to our rescue because un data sets ki jo distributions hain wo mound shaped symmetric distributions nahi hain or wo case me Shevychevs inequality ke zariye immediately we are able to ascertain that at least this much data lies in this particular range. Lekin bohat se ase data sets gin main hame approximately symmetric hum shaped distribution available hoti hai we need to do better than the Shevychevs inequality and in this regard I will now convey to you the other rule that I mentioned the empirical rule students the empirical rule is a kind of a rule of thumb which is valid in the case of symmetric or approximately symmetric hum shaped distributions as I just mentioned. According to this rule approximately 68 percent of the measurements will fall within one standard deviation of the mean that is within the interval x bar minus s to x bar plus s. Also approximately 95 percent of the measurements will fall within two standard deviations of the mean that is within the interval x bar minus 2 s to x bar plus 2 s. Similarly approximately 100 percent of the measurements that is practically all the measurements fall within three standard deviations of the mean that is within the interval x bar minus 3 s to x bar plus 3 s. Ye jo main haf se aakhir mein baat kahe that practically all the data lies between x bar minus 3 s and x bar plus 3 s. Iski bhaja se students ek bada interesting relation milta hain hame between the range and the standard deviation of a approximately symmetric hump shaped distribution the case that we are discussing at this time. Aapku yaad hai ke range ko hame define kiya tha as the distance between x naught the smallest value and x m the largest. Aap agar ham yeh kahe rahe hain ke x bar minus 3 s or x bar plus 3 s ke darmean takhrivan saara data lie krta hain. So, iska matlab to yahi nikalta hain hain ke x naught corresponds approximately with x bar minus 3 s and x m corresponds approximately with x bar plus 3 s. Iska matlab yeh hua ke x naught se le ke x m tak kitni standard deviations ham travel karein ke 3 s to the left of x bar and 3 s to the right of x bar in other words 6 standard deviations. And so the relationship that I am referring to is that for a symmetric or approximately symmetric mound shaped distribution the range is approximately 6 times the standard deviation. Let us apply this empirical rule to an example. As you now see on the screen suppose that we have data regarding the percentages of revenues spent on research and development that is R&D by 50 different companies. And suppose that the figures are 13.5 percent, 9.5 percent, 8.2, 6.5 and so on. Suppose that we wish to calculate the proportions of these measurements the proportions that lie within the intervals x bar plus minus s, x bar plus minus 2 s and x bar plus minus 3 s. In order to do this of course the first step will be to find the mean and the standard deviation of this data set. And if we apply the formulae that I have discussed with you in the previous lectures the mean of this data set comes out to be 8.49 and the standard deviation comes out to be 1.98. Accordingly x bar minus s comes out to be 6.51 and x bar plus s is 10.47. Now if we count the number of data values that lie inside this interval we find that 34 of the 50 measurements in other words 68 percent of the measurements fall between 6.51 and 10.47. Similarly the interval from x bar minus 2 s to x bar plus 2 s comes out to be 4.53 to 12.45 and when we count the number of data values that fall in this range we find that 47 of the 50 measurements that is 94 percent of the data values lie in this interval. Finally the three standard deviation interval around x bar comes out to be 2.55 to 14.43 and we find that 100 percent of the values lie in this interval. Students, agar aap in values say frequency distribution banai, histogram draw kare or frequency curve uske upar superimpose kare, to aap dekhenge ke this distribution is somewhat positively skewed, yani it is not totally symmetric. In spite of this asymmetry students you have noticed that the proportion of data that lies within one standard deviation of the mean is 68 percent. The proportion of data that lies within two standard deviations is 94 percent and the proportion of data that lies within three standard deviations is 100 percent. And isn't it interesting that these proportions are remarkably close to the proportions that I mentioned to you a short while ago as the empirical rule for the case of symmetric distributions. To kahine ka maksad ye hai ke slight departure from symmetry does not matter and this empirical rule gives us a very good way of judging the amounts of data that would lie within specified intervals of our data set. Aur aaye ab ek defa phir shebi-shebs inequality ki baat karthe hain aap ko yaad hain that according to this theorem at least 75 percent of the data values will fall between x bar minus 2 s and x bar plus 2 s for any data set. Aur iss data set me 94 percent of the values are falling in this interval. So, there is no contradiction obviously 94 percent is much greater than 75 percent. Isitara shebi-shebs inequality ki ru se at least 89 percent of the data values should have fallen between x bar minus 3 s and x bar plus 3 s. Aur iss data set kender 100 percent of the values are falling in this interval. So, there is no contradiction, but you will realize that the fact that this particular data set is approximately symmetric rather I should say slightly positively skewed, but not extremely skewed. The information that we are getting from the empirical rule in this situation is more than what we get from the shebi-shebs inequality. Lekin main ek defa phir repeat karungi ke un data sets me jaha pe approximate symmetry aap ko nahi mil rahi shebi-shebs inequality is a useful tool to establish the proportions of our data set that lie in specified intervals. Let us define this one more time. As you now see on the screen, the shebi-shebs theorem states that given a set of n observations x 1, x 2, so on up to x n on the variable x, the probability is 0. At least 1 minus 1 over k square that x will take on a value within k standard deviations of the mean of the set of observations where k is greater than 1. Ab yaha pe ek aur baat note karne ki hai. Aap ne dekhah ke iss mertwa maine iss theorem ko iss tarah se state kia that the probability is at least 1 minus 1 over k square that so and so will happen. Yani shebi-shebs theorem ko aap probabilistic terms me bhi define kar sakte hain aur state kar sakte hain. Since pandhar me lecture ke baad jab hum probability theory ka aagas karenge to uswakth hum iss theorem ko bhi probabilistic terms me define karenge. Phil haal aap isshi baat pe concentrate karen jis tarah se maine aapko aaj iss ki interpretation dehe. Let us consolidate this idea by considering one more example. Suppose that a set of data has a mean of 150 and standard deviation of 25. Now multiplying 25 by 2 we get 50 and hence we can say that we expect at least 75 percent of the data values to lie between 150 minus 50 that is 100 and 150 plus 50 that is 200. By similar calculations we find that we can expect at least 89 percent of the values to lie between 75 and 225 and at least 96 percent to lie between 25 and 275. Ye to hua hamara 8 data set. Now suppose we have another data set whose mean is exactly the same as before that is 150 but whose standard deviation is not 25 but 10. Applying Shebyshev's theorem for this particular set of data we can expect at least 75 percent of the values to lie between 130 and 170. At least 89 percent to lie between 100 and 20 and 180 and at least 96 percent to lie between 100 and 200. Ye jo 96 persent ka figure hai, this is obtained by substituting the value k equal to 5 in the expression 1 minus 1 over k square. Students agar ham in dono data sets ko compare kane to that will enable us to understand the difference in the two data sets with respect to dispersion. As you now see on the screen the comparison of the two data sets is that for the percentage of the data at least 75 percent, at least 75 percent of the data values of the data set number 1 lie between 100 and 200 whereas at least 75 percent of the data values of data set number 2 lie between 130 and 170. Ab in dono intervals ko visualize kane ki koshish kiji. Pehle data set ke liye 100 to 200, dusre data set ke liye 130 to 170. So, you find that the interval for the second data set is narrower than the interval for the first data set. Aur shebi-shev's theorem ki roo se ham ye karein hai, ke at least 75 percent of the data values lie in a narrower interval for that data set whose standard deviation was only 10 and compared with the interval that we have for that data set whose standard deviation was 25 which is much more than 10. I hope that you have understood the point. If your standard deviation is a smaller number, your interval is going to be narrower and if your standard deviation is bigger, your interval will naturally be wider. The graph that you now see on the screen illustrates this point for the case of the symmetric distribution. Agar aapka aapki mean value 150 hai for both the distributions, lekin standard deviation for one distribution is much more than the standard deviation for the other, then the interval for the first one that will contain a certain proportion of the values is much longer than the corresponding interval for the second one. Hence, we discussed the standard deviation with great detail and the discussion of all the two aspects is that we should talk according to the shebi-shev rule and we should talk according to the empirical rule which is valid only in that situation when our distribution hump-shaped approximately symmetric ho, dono hawalo se baat ka jo jist hai wo yehi hai that if the standard deviation is small, our interval containing a certain proportion of the data values will be a smaller interval compared with the situation when the standard deviation is large. I would like to encourage you to study this particular concept in further detail. Aap jitna jyada iska mutaleya karenge or jitne jyada questions karenge or practice karenge, students you will feel at home with this concept and the most important point that you will be able to understand much better the role and the significance of the standard deviation as a measure of dispersion. By the way yeh laabs dispersion, maine bhiis pachchis pachahas mirtabhaisthi maal kiya hoega, aapne gaur kiya ke in saare measures ko measure of dispersion kyun kite hain? The answer is very simple. Obviously the word dispersion comes from the word disperse aur jab aap kite hain ke disperse ho jaaye to aap karenge hain ke phel jaaye yahan se hat jaaye aur iss me phelao ka concept aata hai. And after all what is a measure of dispersion? It is a measure of spread or the scatter of your distribution. Students dispersion ke baare me to hamne tabseel ke saath baat karli. The next concept that I am going to discuss with you is called the five number summary. Aap jab ke hamne kisi bhi distribution ke baare me, teen bohot ahem baat hain discuss karli hain aur samaj li hain. The central tendency of our dataset, the spread of our dataset and the shape of our dataset. Then we are in a position to summarize the important features of our dataset through what is called the five number summary. The concept of the five number summary of a dataset comes under the broad topic of exploratory data analysis. And as you now see on the screen, the five number summary of any dataset consists of x naught, q 1, median, n naught, n naught, q 3 and x m. Students aapne gaur kia ke iss five number summary me hamne dono extreme values yani x naught or x m aur teen ho quartiles ko involve kia hain, yani q 1, q 3 and the median. Of course, you will recall that the median and the second quartile are one and the same thing. Students ye jo teen ho quartiles hain, inki jo position hain relative to each other and relative to the end points. This enables us to determine the skewness of our distribution. Iss baat ko samajne ke liye aaye ham step by step do teen baat ko pe gaur karte. If our dataset is perfectly symmetrical, the following would be true. Number 1, the distance from q 1 to the median would be exactly equal to the distance from the median to the third quartile. As you now see on the screen, the distances on both sides of the median are exactly equal. Secondly, the distance from x naught to q 1 would be exactly equal to the distance from q 3 to x m. As you now see on the screen, the distances on the right side and the left side are exactly the same. And the third point to note is that in the case of an absolutely symmetric distribution, the median, the mid quartile range and the mid range, they are all equal to each other and also exactly equal to the arithmetic mean as you now see on the screen. On the other hand, for non-symmetric distributions the rules that I just mentioned are no longer valid. As you now see on the screen, in case of right skewed distributions, the distance from the third quartile to x m greatly exceeds the distance from x naught to q 1. Also, in case of a right skewed distribution, the median is less than the mid quartile range and the mid quartile range is less than the mid range. Or, if we talk about the left skewed distribution, meaning negatively skewed distribution, situation that will be just the opposite of the situation that I just mentioned for the right skewed distribution. So, students, yahi waja hai ke five number summary, which includes the minimum distribution minimum value x naught, the maximum value x m and the three quartiles q 1, q 2 and q 3. It enables us to determine the skewness of our distribution and also the spread of the distribution because after all you do remember that the distance from x naught to x m is the range and the range is the first and foremost measure of dispersion. Let me explain this concept to you with the help of an example. Suppose that a study is being conducted regarding the annual costs incurred by students attending public versus private colleges and universities in the United States of America. In particular, suppose that for exploratory purposes our sample consists of ten universities whose athletic programs are members of the big ten conference. The annual costs incurred for tuition fees, room and board at the ten schools belonging to big ten conference are for Indiana University 15.6 thousand dollars, for Michigan State 17.0, for Ohio State University 15.2 thousand dollars and so on. In order to state the five number summary of this particular data set, students the first step will be to arrange these values in ascending order. So, as you now see on the screen, when we arrange these values the smallest value comes out to be 13.0 and the largest 23.1. Also, when we do the relevant computations we find that the median for this data set is 15.3 thousand dollars, the first quartile is 14.9 thousand dollars and the third quartile comes out to be 16.4 thousand dollars. Therefore, the five number summary of this particular data set comes out to be x naught minus x naught equal to 13.0, q 1 equal to 14.9, x tilde 15.3, q 3 equal to 16.4 and x m equal to 23.1 thousand dollars. If I apply the rules that I mentioned a short while ago, students we find that the distribution of this particular data set is positively skewed. So you find that the five number summary is a simple and yet very effective way of determining the shape of your distribution without actually drawing the graph of your distribution. This brings us to the end of today's lecture. Next time I will continue with the concept of the five number summary and I will proceed to another very interesting concept and that is called the box and whisker plot. Until next time I wish you the best in your studies of the subject and I would like to encourage you to attempt as many exercise questions as you can easily do. Best of luck and Allah Hafiz.