 Assalamu alaikum, welcome to lecture number 6 of the course on statistics and probability. Students you will recall that in the last lecture and in the lecture before you have dealt with a very important concept and that of the frequency distribution of a continuous variable. In today's lecture I will begin with a diagram which is known as the stem and leaf plot. This plot was introduced by the famous statistician John Tukey in 1977. What was the motivation of this particular diagram? The motivation of this series was that when you form a frequency distribution, the individual observations do not have the same identity as you. The identity of the individual observations is lost. Stem and leaf display is a diagram which offers a quick and novel way of simultaneously sorting and displaying a data set. And it is such a diagram in which each item in the data set is divided into two parts a stem and a leaf. A stem is the leading digit of any number and the rest of the number is called the trailing digit and that forms the leaf and a vertical line separates the stem from the leaf. Consider the number 243. If there are 3 digits in the number, you can break it in two ways. If you treat two as leading digits, then 4 or 3, 43 that would form the trailing digits. And if you treat 24 as leading digits, then 3 would form the leaf. How will we construct a stem and leaf display for an entire data set and not just for one number? Let us consider the following example. Suppose that we have the ages of 30 patients admitted to a certain hospital during a certain week and suppose that the ages are as you now see on the screen. If we were to represent this data in the form of a stem and leaf display, obviously what we should do is to take the first digit of every one of these numbers in the stem part and the second in the leaf part. If we do that, we obtain the stem and leaf diagram that you now see. The very first stem consists of the digit 1 and the leaves corresponding to that stem are 8 and 2. Iski waja ye hai, ke hamare data set me, there was one patient whose age was 18 and another one whose age was only 12. Similarly, you can have all the stems and the corresponding leaves. Ye jo aapne abhi abhi dekhah, of course this is very much a stem and leaf display. Lekin isme hamne in sari values ko iske andar isstara center kia, usi tara center kia. Jisstara se hamare, jis order me hamare basu data tha. Lekin generally ham chaathe hain ke ham us data ko pehle ascending order me arrange karde and then we represent it in the form of stem and leaf plot. If we do that for this particular data set, our array will be as you now see. Our values in ascending order are 12, 18, 26, 27, 29 and so on. Ab hum is a range data ko agar stem and leaf display me convert kare, bilkul usi tara jaise abhi kia tha to hamara stem and leaf display kuch iskhisam ka hoga. So as you can see the stem and leaf display is a quite a useful way of representing data. Jisme aap ki jo observations hain, their identity is not lost. Every single observation is in front of you and yet it has been presented in a compact and presentable and beautiful form. Of course, this can also be converted into a frequency table and in this particular data set as you can see now on the screen, the frequency of the class 10 to 19 is 2, the frequency of the class 20 to 29 is 3 and the frequency of the class 30 to 39 is 5. Yeh mehne kis tara se achhaz kia, dekhye pehli jo stem thi wo thi 1 aur uski jo uske corresponding jo leaves hain, wo hain 2 and 8. Iska matlab yahi hain ke wo jo 2 patients hain unki umre 12 aur 18 saal hain. In dono umron ke liye aap yeh note kare hain ke yeh values 10 to 19 ke darmean fall karteen. Kyuke 10 to 19 aise number hain is tamam ke jin ka jo pehla digit hoga that has to be 1. To yeh 12 aur yeh 18a, zahir hai ke usi class me fall karenge ke jin ka pehla digit hoga 1. Similarly, humari jo dosri stem hain that is 2 aur uske against jo leaves hain they are 6, 7 and 9. Iska yeh matlab hain ke in 3 patients ki umre jo hain they are 26, 27 and 29. Zahir hain ke yeh 3 umre aise hain ke jin ka jo fall karteen usi class me which is 20 to 29. So, this way students it is actually quite simple to interpret the stem and leaf display and also to convert it into a frequency distribution. In this manner for this particular example we have the frequency distribution that you now see on the screen. Our class limits are 10 to 19, 20 to 29 and so on. And the class boundaries as we did in the previous examples of the previous lectures by taking the average of 19 and 20 we get 19.5, the average of 29 and 30 gives us 29.5 and so on. Or iske baad of course when we count the number of observations that fall in all these classes our frequencies come out to be 2, 3, 5, 6, 6, 6 and 2. Taking this frequency distribution into a histogram we obtain what you now see and I think you will all agree following what I said in the last lecture and the lecture before that this histogram is approximately symmetric or better said slightly negatively skewed. If I rotate this histogram by 90 degrees I will obtain what you now see on the screen. But students what did our stem and leaf display look like? Did it not look something like what you now see? So you see the stem and leaf display looks exactly like the histogram. And this is the point that you have summarized your data set, organized it and its look is exactly the same as your histogram or your frequency distribution but the advantage is ke aap ko jab ke usme individual observations ki identity aap loose kar gaythe iske andar you have all the observations in front of you. Let us consider another example and students please note that whereas in the previous example we had a data set containing data values which were two digit numbers. This particular example will explain the situation where some of the data values are three digit numbers. The example reads listed in the following table is the number of 30 second radio advertising spots purchased by each of the 45 members of one particular automobile dealers association in one particular country. As you can see the number of advertising spots purchased are 96, 93, 88, 117 and so on. We would like to organize this data into a stem and leaf display. Also we would like to obtain answers to the following questions. Number one around what values do the number of advertising spots tend to cluster? Number two what is the smallest number of spots purchased by a dealer and number three the largest number purchased? Now in order to solve this question students the first step is to note that the smallest value in this particular data set is 88 and so we will make the first stem value as 8. Also since the largest number of spots purchased is 156 therefore we will have the stem values going up to 15. As you now see on the screen the stem consists of the numbers 8, 9, 10, 11, 12, 13, 14 and 15 and for this purpose we will consider every data value one by one and we will start filling out our stem and leaf plot. So the first number in our data set is 96 which means that the stem value of this number is 9 and the leaf value is 6. Similarly, the second value in the data set is 93 so the stem value is 9 and the leaf value is 3 also the third value in our data set is 88. So the stem value is 8 and the leaf value is also 8. In this manner after having entered the first three data values in our stem and leaf plot we obtain what you now see on the screen. Organizing all the data values we obtain as you now see on the slide for the first stem value 8 the leaf values are 8 and 9. For the second stem value 9 we have the leaf values 6, 3, 5, 6, 4, 4 and 7 and similarly we have the entire display. Now the usual procedure is to sort the leaf values from the smallest to the largest and if we do that the final table appears as you now see on the screen. As you can see this looks very nice because not only all the stem values are in ascending order but also in every row all the leaf values are in ascending order. Now we can draw many conclusions from this stem and leaf plot first the smallest number of spots purchased is 88 and the largest is 156. So we note that 2 dealers purchased less than 90 spots and 3 purchased 150 or more. Also we note that the concentration of the number of spots is between 110 and 130. There were 9 dealers who purchased between 110 and 119 radius advertising spots, 8 dealers who purchased between 120 and 129 spots. So as I said the concentration is in this area and students as far as the shape of the distribution is concerned it is obvious from the stem and leaf display that the distribution is approximately symmetric. Let us now consider another example. Suppose we have the data regarding the mean annual death rates for a certain population for the age groups 20 to 65 as you now see on the screen. These death rates are per 1000 and the figures are 7.5, 8.2, 7.2 and so on. If I use the decimal part in each number as the leaf and the rest of the digits as the stem I will obtain an ordered stem and leaf display as you now see on the screen. Students I leave it to you to verify that the stem and leaf display for this particular data set comes out exactly the way you just saw and also to study a few variations that we have for the stem and leaf display. I will now proceed to the next concept that we have to consider in the area of descriptive statistics and that is the concept of central tendency. An extremely important concept in the whole theory of statistics or that concept which is called the averages in Urfeaam. In this context the very first thing to note and to recall is that any data set that we are going to collect in real life it is going to be essentially variable data, that is the values that we are going to collect, it is obvious that the values are not going to be equal but they are going to vary. So, the first thing we have to realize is that we need some measures, some means by which we are able to describe this variable data that is available to us. A concise numerical description is often preferable to a lengthy tabulation and if this form of description enables us to form a mental image of the data and to interpret its significance, so much the better. Averages enable us to measure the central tendency of variable data and measures of dispersion enable us to measure the variability of the data. Now we define this concept of averages formally. An average is a single value which is intended to represent a set of data or a distribution as a whole. It is more or less a central value around which the observations in our data set usually tend to cluster as a measure of central tendency indicates the location or the general position of the frequency distribution on the x axis therefore it is also known as a measure of location or a measure of position. Let me try to explain my point with the help of an example. Suppose that we have the data of the number of houses that have various number of rooms and we have this data for two different suburbs, suburb A and suburb B. Looking at these two frequency distributions, we should ask ourselves what exactly is the distinguishing feature? If we were to draw the frequency polygon of the two distributions, we would obtain as you can now see on the screen two polygons which are exactly identical to each other except that their location is slightly different. You have seen how interesting it is that the shape of the polygon was exactly identical for both of the data sets but the position on the x axis is different. Now, what is the reason for this? Let us compute the mean of these two distributions. Arithmetic mean, which is the ordinary concept which we all know from the German. If I compute the mean number of rooms per house for suburb A, I find that this number comes out to be 6.67 but if I compute the mean number of rooms per house for suburb B, that is equal to 7.67. So, there is a difference of 1 in the two averages. The difference here between the mean values of the two distributions, this is what has accounted for the difference of the location of the two distributions on the x axis. Looking at the original data once again. The frequencies for suburb A are 8, 27, 30 or 16. The frequencies for suburb B are exactly the same but the difference is that 8 is corresponding to the number of rooms equal to 5. 27 is corresponding to the number of rooms equal to 6 and so on. Suburb B the same frequency distribution is occurring but with a kind of a shift. 8 is now corresponding to number of rooms 6. 27 is corresponding to number of rooms 7 and so on. The same shift you have seen in the data set, the same shift you have seen in the frequency polygons and the same shift, the same difference is reflected in the mean values of the two distributions. So, in this way you have seen that a single value i.e. average represents a whole distribution very easily. i.e. if we only know of an average value, we get to know where our whole distribution is located. Now, these are all the discussions that we have done with regard to the average students. There are two points that are very important and I would like to discuss them with you one by one. The first point is that the example that we have just done, you have seen that in this we had a variable that was a discrete variable. After all, number of rooms in a house was being discussed. So we could have 5 rooms, 6 rooms, 7 rooms but obviously we will not have 7 and a half rooms. But you have noted that average number of rooms per house was 6.67 and average number of rooms per house was 7.67. So now the question arises, what is the meaning of this figure? How can we have 6.67 or 7.67 rooms per house? Students, actually it is not such a big problem. If you pay attention, you will note that we cannot have 6.67 rooms in one house nor can we have 66.7 rooms in 10 houses but we can have 667 rooms in 100 houses. So, this is the way to interpret the arithmetic mean in case of a discrete variable. If your average comes in decimals, you will interpret it in this way. As I repeat this example, we are saying that for suburb A, on the average every 100 houses have 667 rooms which is equivalent to saying 6.67 rooms per house. This was the first point. The second point is that the average value you compute in any problem, you should interpret it with respect to that phenomenon. So, for this example that we just considered, if we are saying that on the average there are 6.67 rooms per house in suburb A but 7.67 rooms per house in suburb B. Students, what does this mean? It does not mean that on the average suburb B has larger houses as compared with suburb A to the extent that on the average, there is one room more in the houses of suburb B as compared with suburb A. Let us now begin our discussion of the various types of averages that we can have. I have given you an example to give you the basic concept of averages or the basic concept of central tendency of a data set. In that, I have repeatedly mentioned arithmetic mean because that is the most commonly used average. But of course, there are several other types of averages too and they have their own importance and their own significance in various situations. As you can see on the slide, the most common types of averages that we have are the arithmetic mean, the geometric mean, the harmonic mean, the median and the mode. The arithmetic, geometric and harmonic means are those averages which are mathematical in character and which give an indication of the magnitude of the observed values. The median indicates the middle position of the data set while the mode provides information about the most frequent value in our data set. And rather than starting with the arithmetic mean in a more formal manner, I would like to begin this discussion of the various types by talking about the mode. As I just said, the mode is that value which occurs most frequently in a set of data. That is, that value which indicates the most common result. If you consider the example of the marks of eight students in a particular test which are 2, 7, 9, 5, 8, 9, 10 and 9, obviously the most common mark is 9. In other words, the mode of this particular small data set is 9. So, let us first of all consider the case when we are dealing with the raw data, not a frequency distribution, but the raw data of a continuous variable. This case may, how do we find the mode? It is very simple, just as before as in the example that I just did, all you have to do is to count the number of times each observation occurs and the value which occurs the most number of times that will be the mode. Let me explain this point with the help of an example. Suppose that the government of a country collected data regarding the percentages of revenues spent on research and development by 49 different companies and obtain the figures that you now see. So, before I actually compute the mode, I think it is nice for me to share with you another very interesting plot which is called the dot plot. So, if I want to formally define the dot plot, it is a dot plot is that plot in which the horizontal axis contains a scale for the quantitative variable that we are wanting to represent and the numerical value of each measurement in the data set is located on this horizontal scale by means of a dot. When data values repeat, the dots are placed above one another forming a pile at that particular numerical location. In this particular example, you have the dot plot as you now see on the screen and as you can see the value 6.9 is occurring three times whereas all the other values are occurring either once or twice. Hence, the modal value is 6.9. You have seen that since there were three values on 6.9 and that pile was the most compared with any other value, that is why it became very easy for us to locate the mode. It gives you quite a good idea of the data set that you are dealing with. For example, in this particular data set that we are considering, the dot plot shows us that a majority of the R&D percentages lie between 7 percent and 9 percent and we can say that almost all of the R&D percentages are falling between 6 percent and 12 percent. This is the discussion regarding the mode in case of raw data pertaining to a continuous variable or we have discussed the dot plot. Students, you will be interested to note that the mode is such a measure that can be computed even in the case of nominal and ordinal levels of measurement. You will recall that the nominal scale is the one where we classify the observations into various categories in such a way that there is no particular order for the grouping. For example, when we talk about the marital status of an adult, we note that it can be classified into one of the following five mutually exclusive categories, single, married, divorced, separated and widowed or there is no order in these categories as such. On the other hand, the ordinal scale is the one where a certain order does exist between the groupings. For example, speaking of human height, an adult can be regarded as tall, medium or short. You are seeing that if we express it in such a way then we can see an order. But, since we have not expressed it in quantitative terms, we cannot say that we are dealing with an interval scale or a ratio scale. As I said earlier, the very interesting thing regarding the mode is that it can be computed even in the case of nominal and ordinal levels of measurement. As an example of the determination of the mode for nominal level data, consider the following. A company has developed five different bath oils and in order to determine consumer preference, the company conducts a market survey. The following chart shows the results of the market survey. In this attractive diagram, you note that the various bath oils 1, 2, 3, 4 and 5 have been taken along the horizontal axis. Obviously, the largest number of respondents favored bath oil number 2 as evidenced by the highest bar and hence we can say that bath oil number 2 is the mode. So, this is the way in which we can determine the mode in the case of nominal level data. Let us now consider the case when we are dealing with the frequency distribution not the raw data, but the frequency distribution of a discrete variable. In case of a discrete frequency distribution, identification of the mode is immediate. One simply finds that value which has the highest frequency. For example, suppose we have the data of an airline and we have this information that the airline found the number of passengers that they had in 50 flights of a 40 seater plane. This data which is in front of you, at a glass you can see that the highest frequency is 13 and hence within one second you can tell that the mode is 39. The mode is 39 passengers. And it is a 40 seater plane, so the company should be quite satisfied that the 40 seater plane size is just the right size for that particular route. This was the determination of the mode in case of the frequency distribution of a discrete variable and of course, the next one is the mode in case of the frequency distribution of a continuous variable. In this particular case students our formula is going to be a bit longer than all that we have discussed until now. In case of grouped data, the first step is to find the modal group, the modal class that class which contains the highest frequency. The next question is what at what point within the class does the maximum value lie or this is a formula for which we will apply the formula to real life examples. In case of the frequency distribution of a continuous variable, the mode is defined as L plus f m minus f 1 over f m minus f 1 plus f m minus f 2 into H. I will define all these terms for you in this formula. L will represent the lower class boundary of the modal class, f m represents the frequency of the modal class, f 1 represents the frequency of the class preceding the modal class, f 2 is the frequency of the class following the modal class and H is the length of the class interval of the modal class. As you just saw the notation that we have for the mode is x hat, yani x likhkar ham uski upar ek chhoti si topi dal dete hain and that is the notation for the mode. Now let me explain this formula that I have just presented to you with the help of the example that we have been considering for the past 2 or 3 lectures regarding the EPA mileage ratings of cars. As you will recall the mileage ratings were in the classes 30.0 to 32.9, 33.0 to 35.9 and so on and the number of cars in these classes was 2, 4, 14, 8 and 2. As I said a few minutes ago the first step is to determine the modal class, yani that class which contains the maximum frequency. As you can see on the screen the class 36.0 to 38.9 is the one that has the maximum frequency 14 and hence this very class is the modal class, yehi wo class hain jiske andar mode likharta hain. Ab aagla sawal hain ke mode ki exact value kya hain uske liye we will apply the formula that we have just seen and to apply it we first need to determine F1 and F2. As I told you F1 is the frequency of the class preceding the modal class and so in this case it is 4, F2 is the frequency of the class following the modal class and in this case it is 8. L is the lower boundary of the modal class and hence as you can see in this case it is 35.95 and H is the class interval of the modal class. In this case the class from 35.95 to 38.95 obviously it is 3 units long and hence H is equal to 3.0 substituting all these values in the formula we obtain the mode as 37.825, yani is example me ham yu samjhein ke jo mileage sapsse zyada kaaro ke liye akar kar rahi hain that mileage is 37.825 miles per gallon. Let us now perceive the mode with reference to the graphical representation of our data set. Ab ko yad hoga ki ham ne EPA mileage ratings ke example ke liye histogram frequency polygon or frequency curve tino draw kiye thi and they were as you now see on the screen. Now if we would like to locate the mode on this diagram of course it will be located on the x axis because after all it is the most common value of the variable that we are dealing with and the variable that we are dealing with miles per gallon occurs on the x axis. Hence as you can see the mode is almost in the middle of the frequency distribution and it is actually directly below the highest point of the frequency polygon. So this is the point to be understood ke in case of frequency distribution of a continuous variable it is extremely easy to locate the mode if you draw the frequency curve of your data set. Ab ki curve rise kar ke fall kar thi hain amtor pe and jo maximum point hain uske directly below jo x value hain that is the mode. Zahir hain ke this is in line with the definition that I have given you. After all what was the mode? The most frequent value in your data set or frequency curve kaha pe rise karegi obviously it will rise at the most frequent value. Students the topic that we have just discussed the mode it is a very important and fundamental concept iski real life applications ko hain inshallah next time discuss karegi and also I will discuss with you the situations when we might have no mode in our data set or we might have more than one mode that is the bimodal situation. Iski tafseel me to hain inshallah next time hi jaingke. Lekin iss vakt jo baat me emphasize karna chaati hoon bo ye hai ke isko humne measure of central tendency kaha hai. Iski bhaja ye hai ke jaise main a bhi kaha the mode is that value of x which is directly below the maximum point of your frequency curve and because in most of the real life data sets the maximum frequency occurs somewhere in the middle of the distribution. Hence the mode is in the middle somewhat in the middle and hence it can be regarded as a measure of central tendency. In the next lecture after discussing the real life application of the mode and discussing some variations of the situation we will proceed to the discussion of the arithmetic mean and the weighted mean. In the meantime, I wish you the best in your studies of the subject and Allah Hafiz.