 Assuming the discussion on the variance, let us say that we want to find the variance of the population, a population of a huge size. Let us assume that the mean of the population is already known to us and we find the variance of the population using a similar kind of formula. Here we write sigma i is equal to 1 to n xi-mu whole square divided by n. Mu was already known and we did not find mu from the xi, so we are able to use n in the denominator. In the case of the sample variance, we use the sample itself to find the sample mean. Then we use the sample mean to find the sample variance and so the n deviations were not independent and we were forced to use n-1. There is another reason why we use n-1. The two reasons are discussed by Montgomery and Runger. If you look at the sample mean or the arithmetic mean, it is known that the sample mean is quite close to the sample values than to the population mean, okay. Since you are calculating the arithmetic mean, since you are calculating the arithmetic mean from the sample, the arithmetic average or the arithmetic mean is going to lie closer to the sample values than to the population mean. This would lead to smaller deviations and smaller sum of deviations than the actual case. In order to compensate for this closeness, the n-1 is preferred. If you are using sample mean, then you are going to get apparently less scatter because the sample mean is more closer to the data values. Hence, instead of using n in the denominator, if you use n-1, then a partial compensation occurs. Using n-1 rather than n is termed as Bessel's correction. So far, we have been talking about data sets and these were also termed as samples and their features such as sample mean, median and mode. When these data sets are large and contain repetitive data, they may be better organized and the frequency distribution may be created. You might have done this in class 9 and class 10 where you created the frequency table. This is the forerunner to the probability distribution. Both the discrete and continuous probability distributions have mean, median and mode. So we will see how to calculate the mean and median and mode for these distributions. The probability distribution functions in the discrete case assigns probability values to the individual random variables present in the sample space. When random variables are continuous, we have a continuous probability density function. The median Xm for a discrete probability distribution is the value within the range of the allowed values that the random variable X may take where the cumulative distribution value is exactly 0.5. In other words, f of Xm is equal to 0.5. On the same lines, we can define the median for a continuous probability distribution function okay where the integral – infinity to Xm f of X dx is equal to f of Xm is 0.5. It can be seen that just as you had – infinity to Xm f of X dx is equal to 0.5, you also have Xm to infinity f of X dx is equal to f of Xm is equal to 0.5. You locate Xm in such a way that both the integrals have the value of 0.5. The median is the second quartile. We may also define a quantile for a probability density function quite easily. The quartile is with the r. Now we are talking about quantiles with the n. By definition, the pth quantile of the continuous distribution is given by – infinity to Xp f of X dx is equal to the cumulative distribution function of Xp and that is equal to p. The pth quantile, the p value here. We also use percentile. Quantile is quantile times 100. So you have the ¼ quantile which is the 25th percentile and which is equal to the first quartile. The mode of a probability distribution we have seen and in the case of a discrete distribution, the mode is the value taken by the random variable X for which the probability is a maximum. In the case of a continuous probability density function, we have the mode as the value taken by the random variable X at which the probability density function attains a maximum. So by the definition of the maximum point with respect to the probability density function f of X, we have d f of X by dx is equal to 0 and d square f of X by dx square, the second derivative is less than 0. So this is the criteria for the mode. It is possible that depending upon where we started or where we are doing the analysis, the maximum we have detected is only a local maximum. The distribution may have several peaks and we may have identified only one of them. In a probability density distribution, if there is a unique maximum, the density function is said to be unimodal. If the distribution has more than 2 peaks, it is said to be multimodal. Sometimes the distribution may have 2 peaks and it is referred to as the bimodal distribution. One example where you may encounter multiple modes is in the particle size distribution diagrams, you may get 2 peaks or 3 peaks. The first peak may correspond to the fines and the second peak may correspond to the coarser particles. Now let us come to the next mode of representation of data. We have seen the data being represented in the form of box plots, scatter plots. Now we want to represent the data in the form of histograms. The histogram is a visual representation of the frequency distribution. They are suited for data that are continuous in nature and also voluminous. The particle size distribution output from a particle size measuring device is quite voluminous and we often represent such data in the form of histograms. Well histogram sort of compiles and combines sections of data before presentation. So there is some loss of information. So you have to step back a bit if you want to look at the overall picture and in doing so, you may miss out on some of the finer points. Similarly, when you generate histograms, you tend to miss out on some individual data values and rather than looking at individual data distribution and this may become very cumbersome and tedious if there are too many in number, we inspect the frequency of occurrence of the data. What we have here is the frequency distribution and dividing the frequency by the total number of observations leads to the probability distribution. The procedure for doing the histograms is given in several books. I am following Decorcy 2003 here. The data is divided into an appropriate number of intervals or classes and the frequency of occurrence of the data within each interval is calculated. The class intervals are also called as bins or cells. Suppose you have the distribution of marks and they may be varying from 0 to 100. You may want to generate a histogram of the class performance rather than looking at individual marks. You may want to create divisions or bins or cells of size 10 and for every 10 marks, you want to find the number of students falling in that particular category. So now you are not looking at individual students marks but you are looking at the number of students who have got marks within a certain interval. Let us say if the interval is 40 to 50, about 30 students in the class may have got between 40 to 50. How many bins or intervals should you use? A general recommendation is between 7 and 20 and these intervals of bins are usually equally spaced. There may be some situations where you may want to go in for unequally sized intervals. Montgomery and Runger discuss what should be done in such a case. We will continue with the discussion corresponding to intervals of equal sizes. If you use very few intervals then there will be a significant or considerable loss of information. If you use too many cells or bins then you are providing too much detail and it will be difficult to sort of pick up an overall trend. Histograms are stable for larger data set which means that their appearance does not change drastically with the change in the bin width. So you want to apply the histograms for data sets which are typically having 75 or more observations. So to repeat or emphasize the point for large data sets the histogram is a good indicator of the probability distribution best describing them. From the histogram you can see whether the distribution is unimodal or multimodal. It also indicates whether the distribution is symmetric or skewed. So this is the histogram. You can see that bins or cells have been created. So each bin is about 5 hours in width. You are having lifetime in hours and frequency. So between 19.5 to 24.5 you have 2 occurrences. Between 24.5 to 29.5 you have 6 occurrences and so on. You have used something like 1, 2, 3, 4, 5, 6, 13 bins here. This is the shape of the distribution. In this case the normal distribution best fitting with the given data, right. How do you find the number of intervals or classes? Some guidelines or suggestions have been provided. It is not mandatory that you strictly follow or adhere to them. They are just guidelines. You may want to tweak the number suggested a little bit to make the appearance of the histogram better from your point of view. So the number of intervals or classes is denoted by Ni and that is given by 1 plus 3.3 log n log to the base 10. Montgomery and Runger have another thumb rule. They say that the number of intervals should be square root of n where n is the total number of data points. There is a small typo here. I will just correct it. The subscript capital I should in fact be small i. So in the recommendation by Montgomery and Runger the square root of the number of observations is taken for the estimated or suggested number of class intervals, okay. The Stooges rule is given by 1 plus 3.3 log 10 n as we saw in the previous slide and the class interval size is estimated by dividing the range with Ni. This is for the number of bins or number of cells and once you have the range, you divide that range with Ni to get the class interval size. That is the width of the each column in the histogram. Of course the class boundaries should be without overlaps or gaps. We can see that this histogram is pretty close to normal. We do not see a great deal of asymmetry in this spread of data and we also do not see very distinct multiple peaks. We see more or less a single peak in this distribution. Of course the interpretation of this histogram in terms of number of peaks and symmetry is a bit subjective and it is expected to be so because it is a visual inspection. When you are generating the histograms, take the lower limit of the smallest class or interval to be slightly lower than the smallest data and the upper limit of the largest class or interval to be slightly higher than the largest number. What it means is suppose you are having data points like 1 to 100. When you are defining the range of the histogram, the lower limit would be slightly lower than the smallest data point. The smallest data point is 1. You may want to take 0.5 and similarly the upper limit of the largest interval or class should be slightly higher than the largest number. So you may want to put 100.5. Even though the largest number is 100, you may want to put 100.5. So the class interval will start from 0.5. The first interval will be drawn using that as the lower limit and the last interval or the class will end at 100.5. Now we come to another aspect. If you look at the histogram, we plotted the data in the form of a histogram and then you can see that a normal or the bell shaped curve was drawn to see whether the distribution was normal. We can do even better by using probability charts. In many cases, your experiments modeling may be based on certain assumptions. You may assume that the errors in the experiments are random and they are normally distributed with 0 mean and constant variance. This is a usual assumption. So you want to check whether the errors are distributed normally. And for this reason, we use probability distribution charts. When we want to find the shape or form of the underlying distribution, we have a speculation either based on experience or other workers' results. So you want to test whether your data also follows that particular distribution. So before you plot the data, you have to do a bit of pretreatment and the full details about the motivation for the pretreatments and all that is given by Organik. What you do first is to arrange the data points in an ascending order. So the data may have been obtained after an experiment and the way you perform the experiments could have been in a random fashion. So rather than doing the experiment with the lowest setting first and then moving on to the medium settings and then going to the highest setting or the largest setting, you might have mixed the sequence of runs so that the external influences are sort of evenly spread for all these settings. So this is called as randomization. But once you have obtained the responses from your experiments, you arrange them in an ascending order. Now if you conduct an experiment or a series of experiments and note down the data and arrange them in ascending order, let us say that you have performed these experiments between June and July. Now you are curious and again perform the same set of experiments between July to August. There is no guarantee or assurance that the data you have created in June to July will be identical to the data you have created between July to August. There are so many random factors that may have influenced the outcome of your experiment. So the datas may not match. However if the distribution of the data is according to a particular trend in the month of June to July, you more or less would get a similar kind of trend for the experiment conducted between July to August. So the distribution of the data would be more similar than the data points themselves. So how to go ahead and identify the trend of the distribution? You do not have to always and only test for normal distributions. There are also other important statistical distributions and probability distribution charts are available for even those. Some of the more common ones are of course the normal distribution and then the log normal distribution, weibull, chi squared and gamma. So once you plot the data in a suitable graphical form and then show that indeed the data is following the distribution, there would not be any ambiguity. It is a much better way than simply stating the assumption that the errors are normally distributed. Then it becomes a speculation if it is not backed with proper proof. Now things have become more advanced and you do not even need a pencil and the probability chart and do the so called plotting procedure. There are several software which have now become available and the data can be plotted and represented very easily. So once you have the data, you arrange the data points in an ascending order. Ascending means the smallest data comes first and the largest data goes to the end and after having arranged the data in the ascending order rank the data. The lowest data which was first in the list is given the first rank. Well this is sort of different from the classroom case where the student, the highest marks is given the first rank. But here we give the data which is the lowest in the lot as the rank number 1. So the data are arranged such that the smallest data has the first rank and the largest data has the last rank. Now there are several ways to plot the data and the data points, the ordered observations are plotted against the so called observed cumulative frequency. The formula is i-k by n-2k-1, k is the parameter in the observed cumulative frequency function. So if you put k is equal to 0.5, this formula here reduces to i-0.5 by n where i is the rank, k is the parameter. You can take it as 0.5 as an example, n is the number of data points in the set. So when you put k is equal to 0.5, this formula reduces to i-0.5 by n. So you are going to plot the observed cumulative frequency against the ordered observations. You can plot i-0.5 by n versus the ordered observation directly on the appropriate probability paper. If you do not have the probability paper, then you have to do a bit more calculations. So we are essentially creating percentiles and these percentiles are also equidistant. There is another formula for finding the percentiles and that is given by i by n plus 1. The formula we are using first is i-0.5 by n, then you have i by n plus 1. So there are several versions for generating these percentiles. So what is meant by this percentile? Please note that the 100 into mth percentile in a data set indicates the value P such that 100 into mth percent of the data are below or equal to P and 1-1, sorry and 1-m into 100% data are above it. So we calculate the percentiles. Now the idea is to see if the values corresponding to the percentiles according to the normal distribution correlate with the actual data values whether they are similarly distributed with respect to each other, okay. So essentially we are going to compare with the normal distribution percentiles, okay. How to do that? We will see. Let us take i-0.5 by n as a probability value. Then you have to find what is the corresponding z value from the normal probability distribution. Usually you are given the value of z and then you find the probability but here we are doing in a slightly different manner. Here we are giving the value of the probability and finding out the value of z. So this is the inverse case. So we have to find the cumulative distribution function inverse of i-0.5 by n. F represents of course the cumulative distribution function of the normal distribution. So to sort of repeat we arrange the data actual raw data in the ascending form and rank the data. The smallest data has the first rank i is equal to 1 and the largest data will have the highest rank. Then what you do is you find out for each of the i values i-0.5 by n. That will be a number, okay. And once you have done that you find the z value from the standard normal probability chart, okay. Once we have the z values plot the z values as ordinate against the ordered values as abscissa. So the z values which were obtained by the inverse of the cumulative distribution function corresponding to i-0.5 by n or plotted against the ordered values in the abscissa. Here we can use the regular graph sheet itself. If we assume the distribution adequately satisfies the data then the points will approximately fall on a straight line. If the points deviate significantly from linearity the hypothesized model is not adequate. So you want to get the data points aligned more or less on a straight line. You would not probably get a perfect straight line but if you are showing a general overall trend it is acceptable. So obtaining the z values for each i-0.5 by n using the normal probability tables is not the only way. We can also use the normal probability paper. When you found the z values using the probability tables you could then go ahead and use the usual regular graph paper. But if you are already having the normal probability paper with you then you do not have to identify the z value for each i-0.5 by n. You go ahead and directly plot the i-0.5 by n on the graph paper on the normal probability graph paper. What is so special about this graph paper? In this graph paper the scale of the ordinate is so adjusted that it represents the z corresponding to the probability of i-0.5 by n directly. You can take a simple analogous situation. When you plot y versus x on a log log sheet you directly only plot the value of x and y on the abscissa and ordinate respectively. For example if you want to test the model y is equal to mx to the power of n. You want to check whether this model is correct. So you can take log y and log x do these mathematical calculations yourself and then go to the regular or usual graph paper and plot the data values. But if you have the log log sheet then you can plot log of y versus log of x directly what I am trying to say here is plot y versus x in the log log sheet. In the y axis identify the value of y in the x axis identify the value of x. You do not have to take log of y and then identify that particular value in the logarithmic scale. You plot y and x on the logarithmic scales okay. When you do that you will find that if your assumption of model is correct the data points are falling on a straight line with slope n and intercept m okay. Remember the slope is n and intercept is m directly not log m okay that is an important thing to note. So in the log log sheet the scales are automatically adjusted in terms of length for the logarithmic basis. Similarly in the normal probability chart when you are plotting the ordered x values against i-0.5 by n you directly use i-0.5 by n and x values okay that is important. You do not have to find the z values when you are having the normal probability chart with you. So on the special normal probability plot you directly place your i-0.5 by n values as ordinate values against each of the corresponding xi values in the abscissa and check for linearity. Here the xi values are the actual data values arranged in the ascending order. Why did we choose i-0.5 by n and not i by n? Our aim was to create equal sized intervals and i-0.5 by n is popular 1 because if you had chosen i by n it may lead to an infinite value of z because probability of z less than infinity is equal to 1 and if you want to find the value of z such that the inverse of the probability is equal to 1 then that z becomes undefined okay. You are doing the inverse of the cumulative distribution function. What is the value of z that will give the required probability? If the required probability for the last data point is 1 then the identified value of z will be undefined okay. So you run into this problem with the usage of i by n in the creation of the equal sized intervals. So what you are doing here is instead using i-0.5 by n. Let us take a simple example. Here let us see that you have marks 20, 30, 37, 44, 50, 60, 61, 70. The marks are already arranged in the ascending order and so the rank is also given here 1, 2, 3 so on to 8 and you can calculate i by n plus 1. You can also calculate i by n. You can calculate i-0.5 by n. So equal sized intervals have been created 0.05 to 0.15 an interval size of 0.1 again 0.1 and so on. So you have the different intervals depending upon what formula you have chosen. Now you find the z value. What is the value of z? That will give the probability value of 0.091. So it is an inverse problem. Given 0.091 this is corresponding to the probability or the area under the curve. So corresponding to 0.091 what is the z value? That is minus 1.335. So an z value of minus 1.335 will correspond to a probability value of 0.091 in the normal distribution diagram. Similarly you can find the z values for all other elements in the column given here. Similarly you have for i-0.5 by n 0.05, 0.15 so on to 0.75. So corresponding to this probabilities given here what are the corresponding z values? You have to take the inverse of the cumulative distribution function. Similarly when you do i by n you have values ranging from 0.1, 0.2, 0.3 so on to 0.8. You can also find out the z values. So you can now plot on the rectangular or the norm. I am using the word normal but in the rectangular or the usual graph paper you can plot z versus the marks and you can check whether they are having a linear relationship. Let us take in this example raw data values which are given to be 176, 191, 214 and so on to 185. These are not ranked. So you arrange them in the order of ascending order and then you give the rank 123 so on to 10 and use the formula i-0.5 by n. You have 10 data points. So n is equal to 10 and i-0.5 by n would mean i is the rank 1-0.5 is 0.5, 0.5 divided by 10 is 0.05. i-0.5 by n rank is 2, i is equal to 2, 2-0.5 is 1.5, 1.5 by 10 is 0.15. Similarly you can calculate for all these other elements for the other ranks. Then you can find out the z value corresponding to these probabilities okay. That –1.645 for 0.05 area under the curve or 0.05 probability is quite famous one. It is familiar to all those people who extensively use the normal chart. Similarly 1 point plus 1.645 corresponds to the area under the curve of 0.95. This is because of the symmetry of the normal distribution. So you can plot the z value against the ordered mark and see whether you get a straight line. One more thing I would like to point out. Why we should use preferably i-0.5 by n and not i by n. For example if you had used i by n, if you had used i by n, the last mark would have had a rank of 10. 10 by 10 would lead to an i by n value of 1 and that would be undefined in the value for z okay. What is the value of z? That would lead to a probability of 1 okay. The answer is infinity okay. Just as you have the probability of 0 having a z value of – infinity, you have the z value of plus infinity corresponding to the probability of 1 or the one which is containing 100% tile of the data. So since you have this difficulty and if you had used i-0.5 by n, the last entry becomes only 0.95 and for which you can easily find out the z value. So to summarize what you do is plot the value of z against the marks and then if they are falling on a straight line, you then can then conclude that it is following the normal distribution. So the given data when plotted in terms of z on the y-axis and xi on the x-axis shows a linear trend. So we can safely assume that the data are distributed normally. To summarize, we have only looked at few important characteristics of distributions and the presentations of data. We looked at mean and median and we also compared them which is a more robust estimator of the central tendency. In addition to knowing the center point of the distribution, it is also important for us to get an idea about the spread. The range is quick estimate of the spread of the distribution but more reliable one which uses all the data points in the distribution or the distributed data is the sample variance. We also have to correctly use the appropriate degrees of freedom in the different formulae we are considering for our analysis. We looked at the box plot, the scatter plot and the histogram method of representing the data. The overall summary and comparison between 2 sets of data could be given nicely by the box plot. When you have 2 sets of data and you want to compare if at all there is any relation between them, then you can go ahead and use the scatter plot. When you have a large data set, then you may want to present the overall trend rather than the individual details and then in that case you go for the histogram. By using the histogram you can see whether the data is symmetric, single peaked or multiple peaked and whether it is following the normal distribution. Finally we looked at more concrete way rather than the visual inspection to identify which distribution the given data better relates to and we found that even without using special probability papers we can check for the distribution. What we have to do is organize the data into ranked data. We put the data in the ascending order then assign ranks to them then use an appropriate percentile creating formula. The more popular one and the more convenient one is i-0.5 by n where i is the rank and n is the total number of data points. And what we then did was to identify the z value which gave the probability equal to i-0.5 by n and we plotted the z value against the ranked x values and if we get a straight line then we say that data is normally distributed. So we will continue our discussions in the next lecture. We will also work out a few example problems which will drive home the concepts we have covered so far. Thank you.