 In today's lecture we will be summarizing the key issues in statistics for experimentalist course. The books which are referenced here are very useful and quite reader friendly. We also have the advantage of listing a large number of problems which may be easily worked out either with the help of a calculator or with the statistical design software. The first book would be a good reference textbook, prescribed textbook rather for semester course on statistics for experimentalists. More details on design and analysis of experiments are given by the second book. The third book is slightly more advanced one but again very user friendly and it has lot of advanced techniques, stresses more on second order designs, various optimal designs and also focuses on the response surface methodology. So this is a good bridge from the fundamental learning of the course and advanced studies for those who are interested. The last book random phenomena also looks at things in a nice manner and gives some interesting explanations on very difficult topics. It in fact puts things in perspective and helps you to understand some difficult concepts. So the key features in experimental work are data representation. We had looked at box plots, scatter plots and normal probability plots, histograms. So each have certain applications and it is important that we use these diagrams or figures as frequently as possible to support our experimental findings. Whenever we do experiments we do repeats and the important thing to notice just the average of the experimental results is not sufficient. You also have to quantify the scatter in the experimental data and see how much is the scatter and will it be smaller than the variation cost by your factors or it will be comparable to the variation cost by the factors in your experiment. So that you can then make the correct conclusion and you can do experiments in a planned manner and unambiguously identify the important variables that affect the response of the experiment and after performing the experiments the next question is where to do the experiments further and how to go in the correct path so that eventually the optimum conditions may be identified. It is important to stress upon the variability. The scatter in experimental data is unavoidable. It is nothing but the law of nature. It is possible to handle scattered experimental data and still draw meaningful conclusions with the statistical tools and as I said earlier in addition to averaging of experimental data importance must be attached to the variability in the data. So the variability in the experimental outcome is modeled as a linear combination of the mean value of the response and the random error component. So the random error component epsilon i is responsible for the observed variations when repeating the same experiment. So with this background we embarked upon an exciting journey with random variables. It is something where the probabilities were associated. We talked about the discrete probability mass functions and then we also talked about for continuous random variables probability distribution functions. So we denoted the random variable as capital X and once its value was known after an experiment or a sample survey it is represented as small x. The random variable x is the originator for probability distribution functions and probability distributions are used to predict the behavior of a group of randomly behaving entities. You cannot predict a single randomly behaving persons behavior but when you have a collection of such people their actions may be better quantified. So we talked about probability distributions and the probability distribution function describes the distribution of probabilities in the continuous random variable domain. For discrete random variables we talked about probability mass functions and you assigned a weight or a probability for every discrete outcome in the discrete random variable domain in the sample space which contains the set of possible outcomes from a particular experiment and you assigned a probability to each of them and when you add up all the probabilities they should total 1 but discrete random variables are not that frequently encountered. The more commonly encountered continuous random variables and we describe the behavior of random variables through appropriate probability density functions and we calculate the probability of a random variable taking value between a and b for instance using the appropriate distribution. So we define the probability distribution function f of x for continuous random variables as probability of x taking a value less than or equal to z then we go from minus infinity to plus z the value mentioned here and then this is the probability density function form it is integrated and we get the cumulative distribution function for a random variable x and that is represented as f of z. So if you want to find the probability of random variable lying between a and b it is represented as a to b integral of f of x dx and that is the cumulative distribution of at b minus cumulative distribution at a then we describe some important features of the continuous probability distribution function we talked about the mean and variance of the distribution and mu is equal to expected value of x is equal to minus infinity to plus infinity x f of x dx. The variance of the probability distribution function was given as e of x minus mu whole squared and that is minus infinity to infinity x minus mu whole squared f of x dx that is the type of here I will correct it. So the variance of the continuous probability distribution is given by sigma squared is equal to e of x minus mu whole squared equals minus infinity to plus infinity x minus mu whole squared f of x dx. Then we talked in length about the normal probability density function this is very commonly encountered in our statistical applications we are going to discuss about the central limit theorem then when you look at various distributions like the t distribution or the chi square distribution they would tend to normal distributions under certain special conditions and so we frequently resort to this normal probability density function or the Gaussian distribution and one advantage of this normal probability distribution is the mean and variance mu and sigma squared are themselves the parameters of the distribution. So the probability density function for the normal distribution is given by f of x is equal to 1 by root 2 pi sigma squared e power minus x minus mu whole squared by 2 sigma squared the ranges from minus infinity to plus infinity we also talked about the log normal distribution. So as far as the normal distribution is concerned we use a notation n of mu sigma squared to represent the distribution. There can be a different normal distributions for different means and variances different mu and different sigma squared it will be good if we can standardize them or normalize them such that one probability distribution graph or table is sufficient to calculate the required probabilities otherwise for each mu and sigma squared we may require different probability tables. So one simple way to do it is to convert the random variable x which is following the normal distribution into a standard normal variable such that the standard normal variable refers to a normal distribution of mean 0 and variance 1. So how to do it we take the usual normal variable x and then we subtract the mean mu from it and divide by sigma. Once we do that we call the transformed variable as z is equal to x minus mu by sigma. This follows a standard normal distribution that means it has a mean of 0 and variance of 1. So the cumulative distribution of a standard normal random variable is denoted as phi of z is equal to probability of capital Z less than or equal to z. Then we talked at length about populations. The populations have to be understood from a decision making quality control or marketing points of view and once we understand the population we can set our goals, objectives, process settings etc. So we cannot collect the data from the entire population. So we have to make use of sampling. We talk about random samples which hopefully will give us sufficient information about the population parameters. So we need to get to know from the sample attributes estimates of the population mean variance nature of the distribution etc. The sample should comprise of independent observations which are coming from the same population. In other words they should represent the same probability distribution. The sampled elements must be independent of each other and each sample element should have an equal probability of getting picked and each sample, the size of the sample is very important, higher the sample size the more we feel confident about the precision of the parameters. Once you have a random sample we can use the collected data to find the sample mean and sample variance and sample mean and sample variance are themselves functions of the random variables and hence they are random variables themselves. And these functions are also called as statistics. Hence the sample mean x bar and sample variance s squared are called as statistics. Now how do we define the sample mean? We found that x bar is given by sigma is equal to 1 to n xi by n, s square is equal to sigma is equal to 1 to n xi minus x bar whole square by n minus 1. Here n is the size of the sample. As far as the sample variance goes we are dividing it by n minus 1 because not all the deviations from the mean are independent of each other. We have only n minus 1 deviations from the mean which are independent and hence sigma is equal to 1 to n xi minus x bar is equal to 0. So this n minus 1 may also be termed as the degrees of freedom in some of the variance calculations as we saw in the lectures. So we looked at point estimators. Point estimators mean that we are going to get single values of the population parameters and we looked at sample mean x bar and sample variance s squared. One important advantage of these point estimators is that they are unbiased estimators. Expected value of x bar will be equal to mu and it can be shown that expected value of s squared is equal to sigma squared the population variance itself. So from the sample mean x bar we are getting an estimate, a point estimate about these population mean mu and from the sample variance s squared we are getting an idea or a point estimate about sigma squared, the population variance. So we call x bar and s squared as our point estimates of the population parameters mu and sigma squared respectively. Let me make that correction. So we are now looking at point estimators x bar and s squared or point estimates of the population parameters mu and sigma squared respectively. Since statistics are also random variables, x bar and s squared also have a probability distribution associated with each of them and they are referred to as the sampling distributions of the sample mean and sample variance respectively. Any random variable has a probability distribution associated with it and so do x bar and s squared. What is really the meaning of the probability distribution associated with the two sample statistics? If we take multiple random samples, each sample may give a different sample mean, each sample may give a different sample variance. So the distribution of the sample means and the sample variances will constitute the sampling distribution of the means and sampling distribution of the variances. So each of these statistics have a probability distribution associated with it. So what are the properties of the sampling distribution? We will look at a general case involving n independent random variables. However, we will assume that all of them come from population that have the same mean mu and variance sigma squared. So the condition of independence of the random variables is important. So when we look at the expected value of x bar, it would be expected value of x1 plus expected value of x2 plus so on to expected value of xn divided by n. So expected value of x bar would be n mu by n because each of the random variables are coming from populations with the same mean mu and variance sigma squared. So expected value of x1 would be mu, expected value of x2 would be mu and so on to expected value of xn will be mu. So you will have expected value of x bar is equal to mu itself. And on similar lines, we can also show that expected value of s squared is equal to sigma squared. So what is the interpretation of the variance of x bar? Variance of the sample mean. So many random samples can be drawn from a population and each of them may have a different mean. This is understandable. So there will be a distribution of the sample means. What it means is if you plot the different sample means you have obtained in the form of a frequency diagram, you will find that certain sample mean values are more popular than the rest. So they will be occurring more frequently. So this is typical of any probability distribution because around the mean value you will get higher values of the probability distribution function. So anyway, you will be having a range of sample mean values and they will be described by a probability distribution and the spread of this sampling distribution of the means is characterized by the variance of x bar. So if the variance of the random variable x is variance of x and is equal to sigma squared, what is variance of x bar? It would be sigma squared by n. So if sigma squared is the variance of the population, the variance of the sampling distribution x bar is sigma squared by n. The population's probability distribution function will have a variance sigma squared whereas the sampling distribution variance will be sigma squared by n. And how we got it? Variance of x bar would be variance of x1 by n plus variance of x2 by n plus so on to variance of xn by n and variance of x1 divided by a constant would be that variance of x1 divided by square of the constant or sigma squared by n squared. Since all of them are having come from populations with the same mean and same variance sigma squared, variance of x1 will be sigma squared, variance of x2 will be also sigma squared. So on to variance of xn which will also be sigma squared, so you have n sigma squared by n squared. So variance of x bar would be sigma squared by n. So you are having a probability distribution of the sample means which is nothing but a distribution of the probabilities of x bar. So this distribution will have a mean same as that of the population mean mu. So if the population is centered around the mean mu, the probability distribution function for the sample mean will also be centered around mu or centered at mu. So both the sample distribution as well as the population distribution have the same mean mu. The variance of the population, parent population probability distribution is sigma squared but on the other hand the sampling distribution variance will be sigma squared by n. So when compared to the population, the sampling distribution of the means will also have the same mean mu but it will have a lower variance given by sigma squared by n where n is greater than 1. So the sampling distribution of the mean will have a variance which is 1 by nth of the variance of the population's probability density function. Then we looked at confidence intervals just as we had point estimates of the population parameters. We also have interval estimates. We looked at putting upper and lower bounds on the population parameters. So we defined a 95% confidence interval and that is given by x bar-z alpha by 2 sigma by root n less than or equal to mu less than or equal to x bar plus z alpha by 2 sigma by root n. Here it is assumed that the parent population's variance sigma squared or its standard deviation sigma is known to us. Alpha is the level of significance. Since we are looking at a lower bound and an upper bound on mu, we use z alpha by 2. So what is z alpha by 2? It is the upper 100 alpha by 2 percentage point of the standard normal distribution. So now let us look at sampling distributions on the central limit theorem. The central limit theorem is a very useful concept or helping hand in statistics. So the population can be described by any arbitrary probability density function. Even though most populations tend towards normality, there may be certain populations which are described differently. So we take a sample from such a population. If the sample size is greater than 30, then the sampling distribution of the mean would tend towards normality. So this is very good. Irrespective of the sample size, if the parent population is normally distributed, then the random samples taken out of such a distribution also would be normally distributed. The random samples taken out of that population also would tend to be normally distributed irrespective of sample size. If the parent population is not normal but we take samples of size greater than 30, then such probability distribution of the sample means would be normal or tend towards normality. So this is a very useful tool in statistical analysis. So what would be the variance of such a distribution? If the sample size is greater than 30, the sampling distribution of the mean would be distributed according to the normal distribution with the mean mu and the variance sigma squared by n. Mu and sigma squared are the mean and variances of the parent population. So we do not know the value of sigma squared. We also do not know the value of mu and on top of it, we do not know the sigma squared and this problem is attenuated if we take a large sample. So when we do not know the value of sigma and we have taken a large sample, then we know that the sample is going to behave normally according to the central limit theorem. We can define a standard normal variable according to z is equal to x bar – mu divided by s by root n. We knock off sigma which we do not know and in place put the sample standard deviation here and then we divide by root n. And in such cases, the large sample confidence interval is now defined for mu as probability of x bar – z alpha by 2 s by root n less than or equal to mu less than or equal to x bar – z alpha by 2 s by root n is equal to 1 – alpha. And from such a definition, we can get a 100 into 1 – alpha person confidence interval on mu as x bar – z alpha by 2 s by root n less than or equal to mu less than or equal to x bar plus z alpha by 2 s by root n. When sigma squared is not known and we are using s squared instead of sigma squared, then for the central limit theorem to hold, we recommend a sample size of greater than 40. So this is to account for the additional variability due to the unknown sigma. Okay, now we know that the population is normal and variance is known. Then if the samples taken from such a normal population of known variance sigma squared are even small sized ones, then the sampling distribution of the means will be still normal. We are taking small samples out of such a population. Miraculously, the sigma squared of that population is known. Then the sampling distribution of the means would be normal with mean mu and variance sigma squared by n. There is no problem with that. What if the parent population is normal but the variance sigma squared is not known and the sample size is small. Then of course without knowing sigma squared, we have to use s squared but we cannot use a central limit theorem and say that the resulting distribution would be normal because here sigma squared is not known and also the sample size is very small. In such cases, we have to use a special distribution called as the t distribution and the t distribution depends upon the sample size and n-1 is referred to as the degrees of freedom of the t distribution. The t distribution tends towards the normal distribution as the sample size increases towards infinity. So we describe the t distribution as t is equal to x bar-mu divided by s by root n. And this t distribution is very useful. It is helpful in hypothesis testing, linear regression analysis and design of experiments. Next, we look at the chi-square distribution. As such, the chi-square distribution is not very frequently encountered but the ratio of 2 chi-square distributions leads to another distribution called as EF distribution. From that point of view, we need to understand what is meant by the chi-square distribution. And the chi-square distribution is used whenever we need confidence intervals on the population variances. We assume that the parent population is normally distributed and let us now define the chi-square distribution. Let us say that you have sample x1, x2, so on to xn which is a random sample from a normal distribution of mean mu and variance sigma squared. s squared is a variance of the sample. The following random variable defines the chi-square distribution with n-1 degrees of freedom. So let us define chi-square random variable as n-1 s squared by sigma squared. This is also a function of the random variable and hence it also has a probability distribution. The probability distribution is called as the chi-square distribution. And whenever we want to find the probabilities using the chi-square distribution, we say that probability of chi-square greater than chi-square alpha k is equal to alpha where the area of the probability distribution function beyond the chi-square alpha k is alpha. So formally a chi-square alpha k is an upper 100 alpha percentage point of the chi-square distribution with k degrees of freedom. So after defining the chi-square distribution, it will be useful to talk about the hypothesis testing. In the hypothesis testing, what do you test for? We test for queries regarding the parameters of the population, okay. We are talking about mu or we are talking about sigma squared. In hypothesis testing, we are using the sample attributes like sample mean and sample variance but we are postulating or passing judgment on the population parameters mu and sigma squared. This is to be noted. We are using the samples taken from the population and the sample attributes obtained there from such as x bar and s squared in our analysis. But we are passing judgments on actually mu and sigma squared. There are two hypothesis here. One is the null hypothesis and other is the alternate hypothesis. And in defining the two hypothesis, we imply that the rejection of the null hypothesis means automatic acceptance of its alternate. So the null hypothesis is usually a statement representing the status quo, okay. The new process you are suggesting is not going to produce an improvement. The regression equation we are talking about, one of the regression parameters are going to be influential in modeling the response. So the status quo is to be maintained in the null hypothesis. In the alternate hypothesis, we try to negate the statement made in the null hypothesis. Either we negate in such a way that we say it is different from a certain postulated value or we say it is greater than or less than. So each of it will have a special kind of test. If the alternate hypothesis is saying not equal to or different than, then we have to conduct what is called as a two-tailed test. If it is alternate hypothesis which is having a statement such as greater than or less than, then we have to do what is called as a one-tailed test. So what we do is we use the sample data and identify a test statistic which is based on the sample measurements and we use this to establish whether the null hypothesis or its alternate is correct. So what are the different types of errors we may encounter in decision making? One decision may be accept H0 or do not reject H0. When H0 is true, it is a correct decision. When H0 is false, it is called as a type 2 error. So that means wrongly accepting the null hypothesis when it is false is called as type 2 error. And then we have what is called as a type 1 error which is more serious. That type 1 error occurs when actually H0 is true but you are rejecting the null hypothesis. Then it is called as type 1 error. It is considered to be quite a serious error. It is like saying that the null hypothesis of a court judge being the defendant is innocent. So that is the null hypothesis and eventually based on the arguments put by the prosecution, he concludes that the defendant is guilty even though he is innocent, then he is committing a serious error. He is actually sentencing innocent defendant. So wrongly rejecting the null hypothesis is called as the type 1 error. And when you reject H0, when H0 is false, it is a correct decision. Now we come to the F distribution and F distribution is used in hypothesis testing, linear regression and so on. And in F distribution tests, we compare ratios of variances in order to infer whether they are comparable to one another or they are much different from one another. So to compare two variances, we require the usage of the F distribution concept. So we assume that the two populations from which the variances were taken for comparison are both normally distributed and both the population means mu1, mu2 and standard deviations sigma1 and sigma2 are not known. The F random variable or the Fischer random variable is defined as the ratio of two independent chi-squared random variables CD1 and CD2 is scaled with its own degree of freedom. So F is a ratio of CD1 by M1 divided by CD2 by M2. CD1 represents the first chi-squared distribution with M1 degrees of freedom and CD2 refers to the second chi-squared distribution with M2 degrees of freedom. And from the definitions for the chi-squared distribution, we can show that F is equal to S1 squared by sigma1 squared by S2 squared by sigma2 squared. And this F distribution is having M-1 numerator degrees of freedom and N-1 numerator degrees of freedom. Here M-1 and N-1 are degrees of freedom for the chi-squared distribution in the numerator and in the denominator. And the percentage point of the F distribution is defined such that probability of F greater than F alpha M1 M2 is equal to integral of F alpha M1 M2 up to infinity F of X dx which is equal to alpha. So if you have F alpha M2 M1 then by taking the reciprocal of that you can find the F of 1-alpha M1 M2. This is a useful result. Now we slowly move into the design of experiments after having a statistical background. The statistical background which was presented is adequate for understanding the statistical design concepts. It is necessary for further understanding of the experimental design concepts. So without knowing these it is not a good idea to venture straight away into design of experiments. We need to know what is meant by a random variable. We need to know what is meant by normal probability distributions. How to compute the probabilities in such distributions? What is meant by the central limit theorem? What is meant by point estimator? What is meant by an interval estimate? What is meant by a 95% confidence interval? So all these things are very essential. We also need to know about chi-squared and F distributions. Even though there are a lot of distributions here like the beta distribution, gamma distribution, Weibull distribution and so on, we do not have to learn all of them. If you learn the normal distribution, t distribution, chi-squared distribution and F distribution, it is sufficient. Even the log normal distribution occasionally crops up. So it is a good idea to know about it. Beyond that we do not need to look into all the statistical distributions. And once you have understood a statistical distribution, you should also learn how to find the mean and variance of such distributions. Because they are very important parameters. So the samples which are drawn from such populations also have distributions okay. So these are very interesting. And when you look at the t distribution, you have to note that there is a degree of freedom as an additional parameter. Even for chi-squared distribution, you have degrees of freedom as an additional parameter. When you have F distribution, you have numerator degrees of freedom and denominator degrees of freedom. So when you have such distributions, you cannot have one single probability chart where you give the appropriate statistic value and then find the probability of the table. It may require extrapolation or some interpolation and that is not going to be very accurate. Fortunately, when you look at statistical software or even spreadsheets, they are having powerful statistical functions and you can find both the probability and also the inverse of the probability using these functions. For example, if you are given the F value, you can find the probability or given the probability, you can find the F value. You can find the inverse of the probability to find the F value. So all these things are possible. So I request you to be familiar with the spreadsheet or statistical software where you can calculate all these probabilities without any ambiguity. Now let us move on to experiments involving one factor and here we are going to assume that the experimental response is influenced by changing only one variable. All other variables or factors are kept at constant values and here we look at mean square treatments and mean square error. We look at the variability caused by changing the level of that particular factor and we also do repeats and we look at the variability in the repeated runs. We compare the variability across treatment levels with the variability due to repeats and find the ratio. If the variability across treatment levels are much higher than the variability due to repeats, then we can claim that despite the experimental error, the factor is influencing the outcome of the experiment. So these are very interesting concepts and since we have only one factor, the mathematical analysis is not difficult and you can easily do the calculations even by having a hand calculator and doing these calculations with spreadsheet is even more easy. So we construct the analysis of variance or ANOVA table where we list out the source of variation, the form of treatments and error. We have the total source of variation, then we have the sum of squares, sum of squares of treatments and sum of squares of error. We have total sum of squares, the degrees of freedom for the treatments are A-1 where A is the number of levels of the factor settings. For example, if I am looking at the effect of temperature on the reaction yield, then if I have 4 different temperatures, then the number of treatment levels is 4. So the degrees of freedom would be 4-1 or in general A-1. A is the number of independent factor levels or settings and then you also calculate the sum of squares due to error. Let us say that each factor setting, you are repeating the experiment n times. Then the degrees of freedom associated with the error are A into n-1. Now you next calculate the mean square. Mean square is nothing but the ratio of the sum of squares of treatments by the degrees of freedom for the treatments. So we get mean square treatments. Then we also have the mean square error which is nothing but the ratio of the sum of squares of the error by the degrees of freedom for the error. So we have mean square error and you calculate f value based on the ratio of the mean square treatment to mean square error. And we see whether this f statistic falls in the critical region. You have a level of significance alpha and you find out what is the f alpha corresponding to the treatment degrees of freedom and error degrees of freedom. The treatment degrees of freedom would appear in the numerator and the error degrees of freedom would be termed as the denominator degrees of freedom. So when you find out the value of f alpha numerator degrees of freedom comma denominator degrees of freedom, you will have the critical value. You see whether this ratio exceeds the critical value and if it exceeds then that statistic is lying in the rejection region and you can reject the null hypothesis. If however the f statistic is lower than f alpha numerator comma denominator degrees of freedom then you have to accept the null hypothesis which states that the treatment is not having an effect on the response. All the variation is only caused by random fluctuations. So it is important for you to state the hypothesis clearly and unambiguously and then carry out the f test and make the correct conclusion. One important thing to note here is whenever you do experiments you please do as many repeats as possible so that you have an idea about the experimental error. It is important for you to randomize your experiments so that the sequence is not coming in a standard manner but in an arbitrary or random manner. So this is to ensure that any discrepancies present in the experimental data over and above the variation caused by the main factors are only due to random factors and not any systematic factors. So any deviations or any scatter in the data may be attributed only due to random errors and not due to systematic discrepancies. So we have to do randomization and when we do randomization the effects of the unaccounted factors are randomly and uniformly dispersed among the different experiments. So the randomization is implemented by running the design experiments in a random fashion and the allocation of the experimental material or resources to different runs is also done in a random fashion. You jumble up the sequence of your runs so that there is no specific pattern. For example if your experiment involves running at a larger speed of the machine and it also happens that particular day is very, very hot then you are having the effect of high speed and high temperature from the ambient that may have a particular effect on the variable. So you may get higher than anticipated response but if you do the experiments in such a way that medium speeds and low speeds of the machine operation are also carried out on hot days then the effect is sort of dispersed among the various settings of your factors. So this randomization is very important so that any interference from the external world is sort of distributed to all these, to all the runs and not only to a specific set of runs and hence they do not seriously affect our process. An additional issue here is blocking. This we will continue in the next lecture.