 Hello again, welcome back to the course lectures. Today we will be looking at the third example set. I hope by now you are having a pen calculator and paper with you as you are going through the example problems. In this lecture, we will be looking at the data representation. So the first example involves the box plot construction. You do not have to really construct the box plot manually. You can use standard software, find out how to create such kind of plots. But even otherwise, if you have no access to these softwares, you can easily construct the plots yourself. The example problem goes like this. The table given below shows the incidence of a deadly disease among children in 8 villages before and after vaccination and draw a box plot and compare the quartiles, interquartile range and medians before and after the vaccination. So you have the ID number of the village 1 to 8 and then you have the incidence of the disease before vaccination. The disease could be polio. Fortunately, it is more or less eradicated. So before vaccination, the incidence was 20, 38, 41, 50, 61, 70, 18 and 5 in different villages. It may be different villages have different levels of hospital facilities or hygiene and so on. So the number is varying from one village to another village. In fact, unfortunately here, the number of incidence of the disease, the number of incidence of the disease after vaccination has in fact gone up from 50 to 52. In other cases, it has reduced and here there is another situation where it has almost doubled. In fact, it has increased by 2.4 times. So these are data with us and we have to compare them with the help of a box plot. So the box plot was generated using Minitab and it shows before vaccination and after vaccination. This shows the median value. The median value is same as the mean value in this particular case. This is the third quartile, second quartile, fourth quartile and this is the whisker. Here, again you have a whisker and you have another whisker here which you may not be able to see and the median has considerably reduced from 39.5 to 14 and this point here interestingly is an outlier. We know that whiskers are drawn up to a data point which is 1.5 times the inter quartile range. So you do not have a whisker extending beyond this and so we have an outlier at this particular point. This obviously corresponded to the 52 okay which we saw in the previous table. This 52 here is an unusually high number. So rather than trying to figure out the numbers with respect to the scale you can also have the data plotted in the software. So the variable before vaccination the first quartile is 18.5, median is 39.5 and the third quartile is 58.25. The inter quartile range is the difference between the third quartile and the first quartile and that comes to 39.75 and the number of data points is equal to 8. Here the first quartile is 11.25, earlier it was 18.5 and the first quartile is 11.25, the median is 14 and the third quartile is at 16.75. The inter quartile range is reduced from 39.75 to 5.5. Whiskers are drawn from 7 that is the lower whisker up to 17. You cannot see 17 here because it is almost matching with the third quartile and you have an outlier 52. So what we can definitely say is the incidence number has considerably reduced and the average or the median value has gone down from 39.5 to 14. So hence vaccination was really effective and also earlier you are having a large spread different villages were having different numbers of incidences of the disease and after the vaccination has been carried out the spread has considerably reduced. Now what we have to do is look at the measures of central tendency and spread. You have 2 sets of data in ascending order. You have to compare their mean values or average values, medians and standard deviations. So you have the data set 1 and you have the data set 2. Let us look at the data. The data points here are ranging from 22 to 38. The range is 38-22 which is 16. Here the data values are ranging from 5 to 55. So the range is 50. So immediately you can say that this is a broader or a wider spread of data points than these. So these may be marks in class 1 and marks in class 2. You can see that the marks in class 1 are more bunched together when compared to class 2. If you look at the average, first let us count the total. It is not too difficult to count this 46, 72, 100, 132, 166, 202, 230, 202 plus 30, 8 is 240. So total is 240 divided by 8 is average would be 30. And then if you count this 1530, 55, 90, 135, 185, 185 plus 255 is 240. So again you have 240. So when you divide 240 by 8, you will get 30. So what we are seeing is the 2 data sets have the same average. So even though you are taking the data from 2 different classes, okay, the average value, the average values are the same for both the classes. So the mean value is 30 for both the classes. The median value is also 30 for both the classes. How do you find the median? You arrange the data in the ascending order which has already been done for you. And you have an even number of data points. So you have 2M data points where 2M is equal to 8. So M is equal to 4. You have to find the fourth and the fifth data point in the set. 1, 2, 3, 4. Fourth and the fifth data point in the set. So you take the average of these 2 numbers, 28 plus 32 is 60. 60 divided by 2 is 30. So you have 30. So that is also matching with the mean value. If you look here, when you take the average between average number of 25 and 35, the average is again 30. So the median is again 30. So both the data sets have the same mean and they have the same median. The standard deviation on the other hand for data set 1 is 5.86 whereas it is 19.09 for the second data set. So standard deviation here is almost 3 times higher than the first data set. And so you can immediately conclude that there is a larger spread in the second data set. By now I hope you know how to calculate the standard deviation. To do that, you have to take the individual data set and individual data set value, let us say 5 and then you have to subtract the mean value from that. So 5-30 is minus 25 and then you have to square that value. So 25 squared is 625. Similarly, you take 10-30 which is minus 20. When you square it, you will get 400. 15-30 is minus 15. You square it, you get 225. So you can add up all these numbers and you divide it by n-1. n-1 in that case would be 8-1 which is equal to 7 and that will give you the variance. So you divide the sum of squares of the deviations with respect to the mean by n-1 where n is the number of data points. So you will get the variance. You take the square root of the variance, you will get the standard deviation. The standard deviation for the second data set is considerably higher than the standard deviation for the first data set. So even if two data sets have the same mean and the same median, they may have different standard deviations. Data of different spreads can have the same mean and median. So let us go to the third example where we discussed the frequency distribution. I have set up all these problems on my own. I hope there are no mistakes in any of these problems. If there are any mistakes, kindly send me the feedback. The lifetime of a special refill used in a ballpoint pen is monitored after providing a specimen each to people from different walks of life. The lifetimes in days of each of the ballpoint pens are recorded in the data table given below. Draw a frequency distribution histogram for this data is the given data following a normal distribution. So you are given a huge data set. There are 100 data points here. So you have data 47 days, 50 days, 52 days so on and here we have so many data points. So we have to draw a frequency histogram of this data. We will do some analysis of this data set to get a feel for it. So the important attributes of the given data set are the smallest data in this set is 22 and the largest data is 80. So the range is 80-22 which is 58. The mean value is 51.77 and the median is 52. How did you find the mean? We add up all the numbers in this collection and divide it by 100. So the mean value is 51.77 and the median value is 52. The median value is pretty close to the mean value. Even more surprisingly the mode is also 52. The mode by definition is the number which appears most frequently okay and that comes to be 52. So we can see as far as this data set is concerned the mean is equal to median well almost equal to. You can approximate 51.77 to 52 without too much of a complaint and the mode is also equal to the median. So all the 3 parameters mean median and mode are matching and the standard deviation is 13.29. What I have done here is put the data in the ascending order and if you look at the data you can see 52 appearing most frequently 1, 2, 3, 4, 5. So no other data is appearing 5 times here. 52 is appearing the most number of times in this given data set and so you have the mode as 52. Well you can also show that the median is 52. You have an even number of data points. So 2m is equal to 100. So m is equal to 50 and so you have to find the average of the 50th and the 51st data point. How many data points you have here? 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. So you have 10 data points. You have to look at the 50th and the 51st data point 10, 20, 30, 40, 50, 50 and 50 first okay. So the 50 data point is 52 and the 51st data point is also 52 and so the average is very easy to calculate. The average is also 52. So the median is also 52. The mode is 52 and the mean was 51.77. Interesting, if you start looking at numbers from a statistical viewpoint you can pick up a lot of interesting relations between the numbers and their different parameters associated with these numbers. So we can generate a histogram and I have used the Minitab version 16 and the histogram looks like this. You can see that I have taken how many cells? 1, 2, 3, 4, 5, 6, 7 bins or 7 cells I have taken. The smallest number was 22 and the largest number was 80. So I have actually started from 19.5 and gone up to 89.5 okay and I am having 7 divisions. How did I get the number 7? There were a few recommendations for the suggested number of bins. If you recollect the earlier lecture we had the Struge's formula and also the number recommended by Montgomery and Runger okay. Square root of the number of observations, square root of 100 would be 10. So about 10 bins okay. If you look at Struge's recommendation it may give a slightly different value. So I have chosen 7 bins here and so between 19.5 to 29.5 you have how many data points? Maybe above 5 definitely maybe 6 or 7 data points between 19.5 to 29.5 to 29.5. So numbers below 33. So it will be 1, 2, 3, 4, 5, 6, 7 okay. So you have 7 data points between 19.5 to 29.5 this is 7. Between 29.5 to 39.5 this may be 13. Between 29.5 to 39.5. So we do not include 29. We start with 33. So this is 3 and we are saying 39.5. So 39 is below 39.5. The entire collection is 10. 10 plus 3 is 13. So you have 13 numbers between 29.5 to 39.5 and that is what is shown in the histogram here. You have 13 as the frequency corresponding to the interval 29.5 to 39.5. So you can count the numbers between each interval and then this kind of bar diagrams are created and a histogram is formed. This is the normal distribution fitted to this distribution of data. Suppose I had increased the cell sizes from 7 to 1, 2, 3, 4, 5, 6. So I have increased it to 13 and there is obviously more detail but somehow at least subjectively to me it does not look as good as the previous histogram. It looks very compact and shows the trend. The trend is approximately normal. So here the data points are a bit more cluttered. Well it is objective and it is up to you but I do feel that 13 cells is far too many. Even the number suggested by Ranger and Montgomery was about 10. So you can see that adequate number of class intervals 7 and the trend is visible. Here it is slightly more cluttered okay. So interesting. The histogram is meant only for a qualitative interpretation of the distribution associated with the data. We have to really quantify it and prove conclusively that the data have indeed come from a normal distribution. So we can check it by plotting the data in the normal probability plot and see whether our normality assumption is satisfied. Again I have used the normal probability plot using Minitab and we will see how to calculate the x axis values and the y axis values. Well there is nothing great in the abscissa calculation. As I told you earlier you have to rank the data from the smallest one to the largest one and the abscissa marking the abscissa is also called as the x axis okay. So the x axis marking is simply the ranked raw data. The numbers which are ranked from the smallest to the largest and so you identify the number on the x axis. The ordinate calculation is quite interesting and there are several versions for the ordinate or y axis calculation. The most common one at least from what I have seen is i-0.5 by n where i is the rank and n is the number of data points okay. Well Minitab uses the median rank method of Benard and it uses a different formula. It uses i-0.3 by n-0.4 into 100 okay. So if you follow Minitab's recommendation, so the rank-0.3 okay rank-0.3 into 100 by n-0.4 is plotted on the y axis and then you have the ranked data value. The smallest value if you recollect was 22 and the largest value was 80. So you are plotting 22 here and then you are plotting 80 here but the y axis is based on the rank. For this data point the rank is 1 and 1-0.3 is 0.7. So 70 divided by 100.4. 70 divided by 100.4 you can take it as approximately 0.7. Remember the say see the scale is starting from 0.1 to 1. So the 0.7 would be somewhere here and that is what is your data point and here that is the 100th data point. The rank is 100. So this would be pretty much close to 1 or pretty much close to 100 because 100-0.3 is 99.7 and 100-0.4 is 100.4. 99.7 divided by 100.4 is pretty much close to 1 and so you have close to a 99.2 or something. So that is how you calculate the y axis corresponding to the x axis values. The important thing is you can see that the data points are pretty much falling on a nice straight line and so you can convince your skeptics by saying that the data indeed came from a normal distribution. Well this is hardly surprising because the original data I chose was generated using the random generator option of the normal distribution. So it randomly picked up values from the normal distribution and gave it to me and then I did the other calculations like ascending order and histogram drawing and so on. So it is hardly surprising that the data is obeying the normal distribution but in your experimental work you may get some data from your equipment or instruments and you may have a model. The discrepancy between the experimental observation and your model will define the residual and the common assumption is you have explained all the controllable factors using the model and the difference between the experiment and the model prediction may be attributed to noise or random error and the common assumption that is made is the random error is normally distributed. So you can calculate the difference between your experimental value or experimental response and the model prediction. You will get the residual. You rank the residuals in the ascending order and then plot it in the normal distribution plot and see whether the data are lying on a straight line or nearly a straight line. You cannot get all the data points lying on a perfect straight line more or less a straight line and then you can say that my assumption that the errors are normally distributed is justified. When you look at this particular graph again the data points are plotted from 22 to 80 and so these are the values 22 is here and 80 is here and other data points are in between. How did you find the y axis here okay and for that the value is – 2.57583 okay a rather frightening number. How did you get this number? The rank is of course 1. So you have 1-0.5 which is 0.5 and then 0.5 divided by 100 is 0.005. So what is the z value which gives the probability of 0.005 that is what you have to check and that small probability corresponds to the area on the left hand side of the normal distribution curve. It is corresponding to the left tail of the normal distribution curve. So let us see what is the probability corresponding to the value of z or to put it in another way what is the z value which will give a probability of 0.005 and that value is marked here. Let us go to the normal distribution table. So – 2.57 is giving a probability of 0.0051 and – 2.58 is giving you a probability of 0.00494. So the required z value is somewhere in between okay it is lying in between – 2.57 and – 2.58 and that is the number we have since it is done by the software you are getting a more accurate value. Similarly corresponding to the next data point you can find the rank. The rank is obviously 2, 2-0.5 is 1.5, 1.5 divided by 100 would be 0.015 okay and the z value corresponding to the probability of 0.015 may be found from the table it is close to – 1.1 or something okay that you can find out from the table. So you can plot all these scores based on the inverse of the cumulative distribution function and you will find that the data points are lying pretty much on a straight line. Now we are going to do something interesting. The class intervals were divided from 19.5 to 29.5. We had 1, 2, 3, 4, 5, 7 such classes and we find the number of occurrences of the data points in each class interval. So we have 7, 13, 20, 29, 25, 51 okay. So between 49.5 to 59.5 you had the maximum number of occurrences that is 29. So the relative frequency or the probability would be obtained by dividing the frequency value by 100. So you have 0.07, 0.13, 0.2, 0.29, 0.25, 0.05, 0.01. So how do you calculate the z value okay. So I am going to report 2 z values, z1 and z2, z1 corresponds to lower interval and z2 corresponds to the higher interval and sorry z1 corresponds to the lower limit of this interval and z2 corresponds to the higher limit or the upper limit of this interval. So you can see that the upper limit of this interval becomes the lower limit of the next interval. The upper limit of this interval is minus 1.693 for z and that became the lower limit for the next interval. But the main question is how did you find z1 and z2. So these are raw scores okay even though they are apparently coming from a normal distribution. Well we have confirmed that they are indeed coming from a normal distribution because the normal probability plots they are showing the linear trend okay. So they are definitely coming from the normal distribution and we have to convert them to the standard normal form. And we use the mean and the standard deviation from the data. We easily calculated the mean and standard deviation and actually x-mu by sigma. Well I have used 52 here. Actually you should use the value of 51.77. I think that is a more accurate value and then the standard deviation. And then you also have for this you put 29.5-52 and then you divide by-13.29. You will get the value as-1.69 okay. And then you have these 2 z values. You have to find the probability that the z will lie between-2.445 and-1.693. What is the probability that the z will lie between these 2 limits-1.693 and-2.445. So you can find out the probability of the z value lying below-1.693 and then you can also find the probability of the z lying below-2.445 and then subtract the 2 probabilities and you will get 0.038. So you multiply 0.038 by the total number which is 100 and you will get 3.8. You can approximately put it as 4. This 4 value compares well okay with the 7. And then when you do the same thing for the next class interval the z1 and z2 are-1.693 and-0.941. So the probability of the z value lying between-0.941 and-1.693 on the standard normal curve is 0.128. When you multiply 0.128 by 100 you will get 12.8 and that number is approximately 13 and the actual number of observations between 29.5 and 39.5 was 13 and so the 2 numbers are matching reasonably well. So you can see that the other numbers are also matching reasonably well and so we can say that the present set of numbers are distributed normally. So these are the standard normal distribution cumulative charts. You can refer to it or take the values from the book or use spreadsheet or statistical software like Minitab to get the probabilities. This concludes our presentation and we have been doing a few problems in the normal and log normal distributions and also we showed some typical problems involving the exploratory data analysis. We saw the histogram, we saw the normal probability plot, we saw the box plots. They are very useful for concise presentation of data where you can express a lot of conclusions in a single location. If you start showing 3 or 4 diagrams and explaining the trends with those 4 diagrams would be quite difficult. Rather if you can summarize all the data in a compact form for example the box plot form, your conclusions and presentations would be more effective. Many of these calculations do not really require statistical software. So even if you do not have access to it you can with the pen, pencil, paper, the standard probability, normal probability charts and the calculator you can do pretty much all the calculations and present the graph, okay. So this concludes the continuous probability distributions and the data representation. We will be now slowly moving on to the next aspect of our statistical analysis. Thank you.