 As-Salam-Alaikum. Welcome to lecture number 13 of the course on statistics and probability. Students, you will remember that in the last lecture, when I had discussed some very interesting or important concepts with you, in particular, you will remember the Shebychev's inequality which gives you a certain minimum proportion of your data lying between specified intervals. Also, you will remember the empirical rule which is valid in case of unimodal, hum shaped approximately symmetric distributions. Also, towards the end of the last lecture, I discussed with you the five number summary. Today, I will discuss with you another interesting concept which is the box and whisker plot and it is directly associated with the five number summary. I would like to revise with you the concept of the five number summary. For any data set, when we have acquired information regarding the central tendency, the dispersion and the shape of the distribution, we would like to identify and describe all these major characteristics of the data in a summarized format and the five number summary is a useful device for this purpose. As you now see on the screen, a five number summary for any data set consists of the five quantities x naught, q 1, median, q 3 and x m. As I told you last time, for a perfectly symmetrical distribution, the distance between q 1 and x tilde is exactly the same as the distance between x tilde and q 3 and you can now see this on the screen. Also, for an absolutely symmetric distribution, the distance between x naught and q 1 is exactly the same as the distance between q 3 and x m. Another important point is that the arithmetic mean, the median, the mid range and the mid quartile range, they are all equal to the central point of our distribution. And what is the situation in the case of a positively skewed distribution? As you can see on the screen, in this situation, the distance between x m and q 3 is greater than the distance between x naught and q 1. Also, in the case of a right skewed distribution, the median is less than the mid quartile range and the mid quartile range is less than the mid range. Hence, you must have noted in the last lecture or in today's lecture that I am using the word right skewed distribution for the positively skewed distribution and I use the term left skewed distribution for the negatively skewed distribution. When I say it like this, then I mean that the right skewed distribution means that the right side tail is longer than the other tail. In the case of the left skewed distribution, the distance between x naught and q 1 exceeds the distance between q 3 and x m as you now see on the screen. Also, in left skewed distributions, the mid range is less than the mid quartile range and the mid quartile range is less than the median. Students, these are reflected in the five number summary. You will remember that in the last lecture, I applied these concepts to an example and I will revise it with you now. As you see on the screen, if we have data regarding the annual costs incurred by students attending public versus private colleges and universities, if we have data regarding the annual costs incurred by students attending public versus private colleges and universities in the United States of America and in particular supposing that our sample consists of ten universities whose athletic programs are members of the Big Ten Conference and the annual costs incurred for tuition fee, room and board at the ten schools belonging to Big Ten Conference are given as follows for the Indiana University 15.6 thousand dollars and for the Michigan State 17.0 thousand dollars and so on. As I said last time, if we want to develop five number summary from this data, then the first step would be to arrange the data in ascending order and as you now see on the screen, the ordered array for this particular data set is 13.0, 14.3, 14.9 and so on and the last value is 23.1 thousand dollars. The first advantage of the ordered array is that we are able to locate x naught, the smallest value and x m, the largest very easily. Also the ordered array enables us to compute the median, the first quartile and the third quartile and for this particular data set, the median comes out to be 15.30 thousand dollars, Q1 is equal to 14.90 thousand dollars and Q3 comes out to be 16.40 thousand dollars. Therefore, the five number summary for this particular data set is 13.0, 14.9, 15.3, 16.4 and 23.1. These five numbers reflecting x naught, Q1, x tilde, Q3 and x m in that order or if we want to use this five number summary to understand the shape of this data or its salient features, so according to the rules that I gave you earlier, you will agree with me that the distribution of the annual cost data is positively skewed. I have come to this conclusion because of two reasons number one, the distance from Q3 to x m that is 6.7 greatly exceeds the distance from x naught to Q1 that is 1.9 and the second point that if we compare the median which is 15.3 and the mid quartile range which is 15.65 for this particular data set and the mid range which is 18.05 we observe that the median is less than the mid quartile range and the mid quartile range is less than the mid range. So, as I said last time, the five number summary is a simple yet effective way of describing the salient features of your data set and also to determine the direction of skewness without actually having to draw the histogram or the frequency polygon of your distribution. Students, I have done five number summary concept which we did last time and today we revised it in detail again and the reason for this is that the next concept which I have discussed with you is directly linked with the five number summary. This next concept as I said earlier it is a very interesting diagram actually and it is called the box and whisker plot. As you now see on the screen in its simplest form a box and whisker plot provides a graphical representation of the data through its five number summary. The plot consists of a box which is partitioned inside by a vertical line and also the plot contains two horizontal lines one on the left side and one on the right and these are called whiskers. So, the box and the whiskers together constitute the box and whisker plot. So, I will answer this question step by step. The first step as you now see on the screen is that the variable of interest that is the X variable it is represented on the horizontal axis. Next a box is drawn in the space above the horizontal axis in such a way that the left end of the box aligns with the first quartile of our data set and the right end of the box aligns with the third quartile. After that the box is divided into two parts by a vertical line that aligns with the medium. Next a line that is a whisker is extended from the left end of the box to a point that aligns with X naught the smallest measurement in our data set and last but not the least another line that is the other whisker is extended from the right end of the box to a point that aligns with the largest measurement in the data set. So, in this manner we get this very interesting plot a box and two whiskers. Let me explain this point to you with the help of an example. Suppose that we have data regarding the downtime in hours recorded for 30 machines which are owned by a large manufacturing company and it is known that the period of time covered was the same for all machines. The data regarding the downtime of the 30 machines is 4, 6, 1, 8, 1, 4 and so on. In order to construct the box and whisker plot we need to locate the smallest and the largest observations in our data set and these values are 1 and 13 respectively. Also we need to compute the first quartile, the medium and the third quartile and for this particular data set these values come out to be 4, 5 and 8.25. As such we obtain a box and whisker plot which you now see on the screen. Students ye jo tino quartiles hai Q1, Q3 or X tilde ye mani kistra compute ke hain aapko yaad hoga ke jab me aapke saath quartiles ki discussion kar rahi thi I gave to you in detail the formulae that are valid in the case of the frequency distribution of a continuous variable. Lekin mani boh zyada detail me aapke saath ye point discuss nahin kiaatha ke how you would find the quartiles in case of raw data. But I hope that you have already studied quite a bit of material from your own textbook and from other books and you do have an idea as to how to compute the quartiles in the case of raw data. In this particular example as you now see on the screen the first quartile is the 30 plus 1 over fourth that is the 7.75th ordered measurement and it comes out to be equal to 4. Similarly, the median is the 30 plus 1 by 2th that is 15.5th measurement and that comes out to be equal to 5 and in a similar way the third quartile is 3 times 30 plus 1 by 4th that is the 23.25th ordered measurement and that comes out to be equal to 8.25 students. The most important question is ke is box in visceral plot ko hum interpret kistra karenge. The point to understand is that this interesting and simple diagram gives us quite a lot of information. It gives us information regarding the spread, the location of concentration of the data values and the shape of your distribution. In the example that we have just considered the box and visceral plot reveals that 50 percent of the measurements are falling between 4 and 8.25. Also the box and visceral plot indicates clearly that the median is 5 and the range is 12 and the most important point is as regards the skewness of the distribution that since the median line is closer to the left end of the box hence the data are skewed to the right. I said that if the median line is closer to the left end of the box it means that the data is skewed to the right. If the median line is closer to the right end of the box the distribution will be skewed to the left. And of course for a perfectly symmetrical distribution the median line will be neither closer to the left nor closer to the right and it will be in the exact center of the box. So this is one point that you should note that the median line is it closer to the left, is it closer to the right or is it exactly in the middle of the box. The second equally important point is that you should look at the whiskers carefully and note whether the left whisker is exactly equal to the right one or is it that one of them is shorter than the other one. And a very interesting relationship here between the point that which whisker is smaller and which is longer and the point that the median line is closer to the left side of the box or to the right side of the box. So as you now see on the screen for a negatively skewed distribution the median line is closer to the right end of the box and the whisker to the right is shorter than the whisker to the left. On the other hand for a right skewed distribution the median line is closer to the left end of the box and the whisker to the left is shorter than the whisker to the right. Let us consolidate all these ideas by going back to the example of the ten universities. As you will recall the annual costs incurred on tuition fee, room and board for the ten universities were available and the five number summary of this data set came out to be x naught equal to 13.0, q1 14.9, x tilde 15.3 and so on. So as you now see on the screen the box and whisker plot for this particular data set is such that the median line is closer to the left end of the box and the left whisker is much shorter than the one on the right. So is box and whisker plot say, According to the rules that I conveyed to you a short while ago it is obvious from this box and whisker plot that the data of the annual costs of the students of these universities is positively skewed. I will encourage you that if this is a very small data set you try to draw its frequency polygon or its frequency curve and see does it look like a positively skewed distribution as indicated by the box and whisker plot. So this sari discussion ka jist yeh ke median line jist side ke talaf hogi skewness ke direction uske opposite hogi. Aur dosara point yeh ke jist side ka whisker zyada lamba hoga skewness ke direction ussi side ke hogi. Students the box and whisker plot comes under the realm of exploratory data analysis or EDA as it is called and this is a relatively new area in statistics. The diagrams that you will now see on the screen will show a comparison between the box and whisker plot and the more traditional method of drawing the frequency curve. The first diagram shows the situation where our distribution is absolutely symmetric and as you see in this case the box and whisker plot is also absolutely symmetric. When I say symmetric I mean that if you place a mirror vertically in the center of the box and whisker plot you find that the left side of the plot is the mirror image of the right side. The lower diagram on the screen shows the situation of the rectangular distribution which is the one again in which the left side is the mirror image of the right side. Although the rectangular distribution is not encountered as frequently as the hump shaped distribution but this is also a very useful distribution to describe the concept of absolute symmetry. Students it is very unlikely that in a real life situation you will collect data whose distribution will be absolutely symmetrical but in many situations we do get data which is approximately symmetrical and box and whisker plots you have that will enable you to judge that very easily. Agar apki median line it is almost in the middle of the box and the left whisker is almost as long as the right one you can immediately say that this particular data set is approximately symmetrical. As stated earlier in case of the negatively skewed distribution the median line is closer to the right end of the box and the right whisker is shorter in length than the left one. On the other hand in case of a positively skewed distribution the median line is closer to the left end of the box and the left whisker is shorter in length than the right one. Now positively skewed distribution ko hum iss tara se bhi explain kar satte hain ke hain ke the concentration of the data points is towards the low end of the scale. Ishi tara iss ke opposite left skewed distribution ke liye hum kahenge that the concentration of the data points is towards the upper end of the scale. Universities wale example mein kuch issi tara ke situation thi ke approximately 75 percent of the values were concentrated on the lower end of the scale of the annual cost of the students and the remaining approximately 25 percent were dispersed along the long right whisker of the box and whisker plot. Students I have discussed with you the concept of the five number summary and the box and whisker plot in a lot of detail. I hope that you will study these concepts in depth and practice with a lot of questions. The next concept that I would like to pick up now is the Pearson's coefficient of skewness. Is silsile mein sabse pehla point ye note kane ke we may think that by providing information regarding the center and the spread of the distribution that is by computing the mean and the standard deviation we have done a perfectly adequate description of the data. But in reality two distributions who have exactly the same mean and exactly the same standard deviation they may be quite dissimilar. Let me explain this point to you with the help of an example. Suppose that we have data regarding the age of onset of nervous asthma in children and we have this information regarding two categories of children, children of manual workers and children of non-manual workers. As you see on the screen the data regarding children of manual workers has frequencies 3, 9, 18, 18, 9 and 3 corresponding to the age groups 0 to 2, 3 to 5, 6 to 8, 9 to 11, 12 to 14 and 15 to 17 years. On the other hand the frequencies for the children of the non-manual workers are 3, 12, 9, 27, 6 and 3. Although the total frequency is 60 for both distributions, I hope that you realize that the distributions are quite dissimilar. I hope you realize by looking at those frequencies that the two distributions are quite dissimilar. If we want to compute mean or standard deviation then we will do all those calculations which I conveyed to you earlier and as you now see on the screen the calculations for the children of manual workers and those for the children of non-manual workers provide all the sums that we require in order to substitute in the formulae of the mean and the standard deviation. The interesting point is that for both of these distributions the mean age comes out to be 8.5 years and the standard deviation 3.61 years. The shape of the two distributions will be quite different because the pattern of the frequencies for the two categories of children was quite different. So as you now see on the screen the distribution for the children of the manual workers is absolutely symmetric whereas the distribution for the children of non-manual workers is quite different from symmetry. So I hope that this point is clear that two distributions which are quite different with regard to skewness they can have exactly the same arithmetic mean and the same standard deviation and yet be different in terms of skewness. The Pearson's coefficient of skewness is one method of measuring the skewness present in a data set and as you now see on the screen the Pearson's coefficient is given by mean minus mode over standard deviation and if we apply the empirical relation between the mean, median and the mode the Pearson's coefficient of skewness becomes 3 times the mean minus median divided by the standard deviation. As indicated earlier in case of a positively skewed distribution the mean is greater than the median and hence the Pearson's coefficient will come out to be positive. In the case of a negatively skewed distribution the median is greater than the mean and the Pearson's coefficient comes out to be a negative quantity and in case of an absolutely symmetric distribution students you already know that the mean and the median coincide and hence the Pearson's coefficient will be exactly equal to 0. Let us apply this concept to the example that we just considered. As you now see on the screen for the children of the manual workers the mean and median both are equal to 8.50 and hence the Pearson's coefficient is exactly 0 whereas for the children of the non-manual workers the mean is equal to 8.50 and the median is 9.16 and hence the Pearson's coefficient comes out to be minus 0.55. The negative answer in the case of children of non-manual workers indicates that the distribution for that category of children is negatively skewed and I would like to encourage you students to look again at the frequency polygons of the two categories of children and to compare the results that you just obtained with what you see in that graph. Students in today's lecture we discussed various ways of ascertaining the skewness in our data set and the last thing that I discussed is the Pearson's coefficient of skewness. Next time I will begin with another very interesting measure and that is called the Baulay's coefficient of skewness. In the meantime I hope that you will enjoy revising and practicing the various concepts that you learnt today. Until next time, Allah Hafiz.