 Welcome to Dealing with Materials Data Course. In this session, we would consider the case of sampling distributions. Let us first review what we have done in the past. We introduced certain special random variables with some discrete random variables, we call them discrete distribution functions. Continuous random variables can also be called a continuous random, continuous distribution functions. With each one of them, we introduce what is their mean, what is their standard deviation. We once again introduced central limit theorem. And in just the previous session, we looked into how to generally check the correctness of assumed distribution graphically through probability plots. What we want to do in this sessions or coming up all the sessions from now onwards is make a relationship between the observed sample and the population. Let us try to understand. Why are we doing all this? The whole purpose of learning statistics is that we all are aware that there is a general population, there is a larger information in the world, there is an entirety of certain kind of data in the world. But we cannot look into each and every data points that is available in the population. This entirety I call a population. But what we do is we take a small sample out of it, we observe it, we do certain calculations on it and we try to make the judgment on what the population would be like. This is the whole game of statistics. So here now onwards in this course, we are trying to do something to set up our understanding of population through a random sample. So we are going to introduce what is called a random sample once again. We have already done it once in the past, we are going to do it again. We will introduce something called a sample statistic. Then we will talk about the most common way to understand any population or any sample for that matter. As we studied in the descriptive statistics, the most common numerical method is the central tendency or we can call here sample mean and then dispersion of the data through sample variance. In this particular session, we are going to concentrate only on the sample mean. We will find what is the expected value of sample mean because remember sample mean is also a random variable. So it has a expected value, it has a distribution and therefore we would like to find out what is its expected value, what is its standard deviation of the sample mean. We will revisit central limit theorem because it tells us more about the nature of distribution of sample mean when the sample size is very large. We will also give an example, we will also show that how a binomial distribution can be approximated as a normal distribution. Remember in the past, we have approximated binomial by Poisson. So we will clarify as to when it is to be distributed, approximated as a normal distribution. We will give some examples. So let us understand as I said the population and the random sample. As I said statistics is a science, the whole purpose of statistics is to understand a population through a small observed sample value, small compared to the population itself. So here I have given an example, this example concerns that you imagine that in a production line in an industry certain alloy is being produced and the customer is requesting an alloy with a certain value of strength property of the alloy. Now it is the industry's responsibility, marketing department's responsibility to tell them that our alloy has a certain strength properties. It has a mean strength properties of this. How will it do it? Well, it cannot start testing each and every alloys that ingots that are produced. And what it may do it, it may take a lot of it, the randomly chosen ingots and then from each ingot it will randomly choose a sample and it will put through a test and find its strength property and then it will say that our lot of production will have this much of strength property. So here the complete production lot is what I call or what we call in statistics population and whatever that small sample you take is called a random sample. So formally speaking capital X1, capital X2 up to capital Xn be an independent random sample from a common distribution function f. This common distribution function f refers to the population distribution function then it is said that X1, X2, X3, Xn constitutes a random sample or just a sample from a population with a distribution f. Here the randomness indicates that there is no obvious order in which X1, X2, Xn are selected. There should not be it should not happen that you take always the first ingot from a lot or you take every 10 ingot from a lot you must choose it in a random fashion so that the sample becomes a representative of the population. A sample statistic or simply a statistic is a quantity that is computed using only the data. You have come across such quantities a lot in this course for example average standard deviation, maximum of a data, minimum of a data all these quantities are calculated without making any assumptions on what the distribution of the population is. It is purely from the data such a quantity is called statistic please remember it is singular it is statistic. Now as I said our whole interest is in finding out f the population distribution f capital f. There are two cases here if you know the form of the distribution f as in the past I said that if you are dealing with the strength property of a material there is a possibility that it follows a log normal distribution. So your f has a form of a log normal distribution. If you are looking at the differences suppose you already know that this whole alloy is designed to have a particular value of strength property say yield strength and in that case what you are going to do is you are going to actually measure the yield strength and take the difference. If you take such difference between the yield strength you are sort of trying to calculate an error and therefore it may follow a normal distribution with a mean 0 and some variance sigma square. So it means that you are able to decide a priory right from the beginning you are able to decide what is the form of the distribution f. But what you do not know you know the form of the distribution up to a point up to a level where you know you do not know the parametric values of that distribution. So if you have assumed normal distribution for an error you say that the mean value is 0 or you can say that you do not know what is mean value the two unknown parameters are mu and sigma square. If you are looking at a Weibull distribution and you want to look at only two parameter Weibull distribution then you are worried about the scale and shape parameter. If you introduce you think that no no it may be a three parameter Weibull distribution then you have the form of the distribution which is Weibull but what you do not have is the three parameters values. So the psi which is the location, scale and shape. So such situation it is called a process of parametric estimation. You know the form of the distribution but you do not know the parametric value. So then it falls in the first case which is called a parametric estimation and inference. Suppose you do not even know the form of the distribution that happens or you are not able to justify it properly that it should fall in this particular form of distribution then it is called a non-parametric inference and estimation. In this present course we are going to consider only the case of parametric estimation and therefore the common distribution will be referred as a population distribution we know its form but we do not know its parameter and we are going to go for parametric estimation case. The first and foremost estimation that we would like to learn about is as I said in descriptive statistics also mean plays a central role and so does the standard deviation or a variance. So we will talk about two sample statistics. One statistic is sample mean and another is sample variance which we will follow in the next session. In this session we will consider sample mean. So consider the case of a population with distribution function f with mean value mu and a variance sigma square. Remember we are not saying that this is a normal distribution this is a Weibull distribution we are just saying that it is some distribution which has a mean mu and variance sigma square. Mu and sigma square are also called population mean and population variance respectively which you do not know these are now your two unknown parameters. So we are looking at a very generalized case right now. Let x1, x2, x3, xn be a random sample from the same population with mean mu and variance sigma square and then we define sample mean as an average of the random sample. So x bar, x bar is an average of random sample value. So when you actually want to find out you are going to replace x capital Xi by its real realisation small xi. But here we would like to show that you see I take a sample and I will find an x bar someone else takes a sample and finds an x bar there are many values of x bar. Therefore, we are talking here random sample as a random sample as a you know random variables arising out of the same distribution f with mean mu and sigma square and therefore x bar is a function of random variables and therefore x bar is also a random variable. So then we can take an expectation of x bar, so if you take expectation of x bar follow the algebra of expectations it is 1 over n summation expectation of xi and therefore it is 1 over n summation of mu because it is coming from a common mean mu for every xi and therefore it is mu. It means that for the random variable x1, x2, xn is an observed values then the x bar is a x bar has an expected value of mean mu or it is an estimate of mean mu. You can say that mu can be estimated using x bar. In the next further session we will call this a point estimation. What is the variance of x bar? Let us calculate variance of x bar is 1 over n square summation variance of xi and therefore it is n sigma square divided by n square and therefore it is sigma square by n. I hope you have remembered all our past thing that if variance of x is given to you then variance of a x is equal to a square variance of x. This is the relationship I have used here in order to show that variance of x bar is n sigma square divided by n square. This n square comes in this form, it comes from this identity and therefore it is sigma square by n or the standard deviation is sigma divided by square root of n. Let us try to understand this. This says that the expected value of x bar is mu, it does not depend on the sample size. It is always mu, the population mean value. But the variance of x bar is going to decrease as n will increase. So if n increases then the whole value actually decreases. It means that the spread of the distribution becomes smaller and smaller and therefore the distribution mean value x bar value comes closer and closer to the actual population mean. This is shown in the next slide. Here you see that we have plotted, I have plotted standard normal distribution function, standard normal pdf not distribution function, probability density function. Blue line is when n is equal to 1, the green line is when n is equal to 2. You see that the spread is drastically decreased when you come to n is equal to 2. If you take n is equal to 5, it is further decreased and if you go to n is equal to 10, it is even further decreased. It means that the as your n value becomes larger and larger, the spread, the distribution becomes closer and closer to the mean value, population mean because we have taken standard normal distribution. It is a population mean value mu and it is coming closer and closer to that. Now let us revisit central limit theorem. What does central limit theorem say? Let us recall, let x1, x2, xn be independently and identically distributed random variables. So it is a random sample with a common mean and a finite common variance sigma square. Then for large value of n, probability of summation of xi minus n mu divided by sigma square root n is less than t is approximately probability of z less than t where z is a standard normal variate. What it tries to say is that this quantity, what it tries to say is that this quantity comes closer and closer approximately to z which is distributed as normal random variable with 0 mean and 1 standard deviation, it means that it is a standard normal variable. If you divide the numerator and denominator by n, then you find that x bar is minus mu divided by sigma square root n is less than t also has behaves like a standard normal variate. Now let us look at this quantity very carefully. As n tends to infinity the sigma over square root n becomes smaller and smaller and therefore it actually means that x bar comes closer and closer to mu. So this is now with reference to the present understanding of random variable and population. We can say that a population mean comes closer and closer to the, sorry the sample mean comes closer and closer to the population mean and if you take the standardized variable, this is another term that I would like to introduce. Any random variable like here it is x bar, if you subtract it from its expected value x bar and divided by its variance square root of variance x bar then this is called normalizing normalization of random variable x bar. Why it is called normalizing? Because it starts behaving like a standard normal variate. It behaves like a standard normal variable. So this is called a normalization. So now in future when you do when you hear anywhere the when the values are normalized what is happening this is what it really means. So now let us worry about what is the approximate distribution of sample mean. Again we say that x1, x2, xn is a random sample with a common mean and a common finite variance sigma square. Then the central limit theorem tells us that the normalized value of x bar follows a standard normal distribution, sorry. So then it follows a standard normal distribution. This is equal to z where z follows standard normal. This is what exactly we talked about. The question again comes how large should be n when you can consider that it is tending to infinity. Well the thumb rule is that it should be at least 30. Please recall the discussions we had in the previous sessions when we discussed the central limit theorem. The same thing applies here also. So let us now see another approximation. Applying the normal that is the central limit theorem we would like to study the the another approximation of binomial distribution when n tends to infinity that is when the number of independent Bernoulli trials tend to infinity or then when the number of Bernoulli trials are very large. Let us consider the case. You remember the previous question we had a production unit has a probability of producing defective unit as 0.01 and a randomly chosen batch of 100 units. So random this is there is a random variable with 100 units n is equal to 100. What is the probability that it has at the most 50 defective units? Maximum number of defective units it can have is 50. It does not have more than 50 units is what defective units is what it says. You recall we did the same thing by calculating the exact value in order to show the binomial approximation to the Poisson distribution. Now we are going to the normal approximation. So x1, x2, x3, xn 100 denote the 100 units and we say that xi is 1 if the unit is defective and xi is 0 if it is not defective. And we define x as a summation of 1 i summation of xi i is equal to 1 to 100 and we want to find probability that this x the summation is less than 50. So xi is a Bernoulli trial with probability of defective as 0.1. So if you look at it if you apply the central limit theorem it says that summation of xi minus n times the population mean divided by population standard deviation multiplied by square root n behaves like a normal standard normal variate z. Here the mu in Bernoulli trial is p which is 0.1 and sigma square in Bernoulli trial is p times 1 minus p which comes to 0.564. You can calculate it out. And therefore probability that x is less than or equal to 50 is probability that x minus np divided by square root of n multiplied by p times 1 minus p is this and this says that this comes to less than minus 1.33 which is approximately on standard normal variate less than this and this probability turns out to be this. This probability you can calculate either using normal tables, there are normal tables available I will discuss about it in the next slide or you can use R and find out what is the probability value or you can use even excel table excel spreadsheet there also they have a function to find out this particular value. But in all the cases I say that please remember to look at the help. So, let us look at here, here I have a standard normal distribution plotted, this is the standard normal distribution red line and this is minus 1.33 value as the Z is going to be less than or equal to that. So, we are looking for this probability. The question is how much is this? General normal tables that are available tend to give you the probability of any value t less than or equal to this. So, it gives you integration of minus infinity to t 1 over square root 2 pi e to the power minus 1 half x square dx. It gives the area under this curve, this is the curve of 1 over square root 2 pi exponential to the power minus 1 half x square, this is the curve. So, this is the value which is generally tabulated. But remember that in the normal distribution it is a symmetric bell shape curve. So, if you come exactly at the 0 value, this you already know is a 0.5, it is half and so is this is also, this is also 0.5. So, number of times here it takes a value of t between minus infinity and plus infinity. Sometimes it takes the value of t greater than 0 less than infinity and it will give you values of this nature. How do you tackle this? It says that the t is here. Well, remember that whether you take it here or you take 1.33 here, this area and this area are exactly the same. So, this probability and this probability, if I call this probability p 1 and this I call probability p 2, then because of symmetry p 1 is equal to p 2. And therefore, instead of finding the probability on this side, you can find a probability here. So, what I am trying to say which must have been said in your R session, I just want to repeat it that when you apply a normal distribution table or you use a standard routine to find out the normal probability, please check the help and make sure whether it considers this kind of probability or it considers this kind of probability. In our present case, this value turns out to be 0.092. So, here I have. So, if you calculate it out, this turns out to be 0.092. If you look at this, if you have to put this t value, so when you put mu is equal to 0.1 and sigma square is equal to 0.564, you want to have your x less than or equal to 50. So, x minus np divided by square root of np times 1 minus p turns out to be 50 minus 60 because it is 100 with possibility, probability of failure probability of a defective is 0.1. So, probability of its point where does this came from, wait a minute, wait a minute. I think I have made a mistake, we must correct it. I think it is time I correct it. Probability, so n is this, so this will be actually 10. So, we are looking for this denominator is correct, this value will be different. We are looking for 10, 50 minus 10, so 40 divided by square root of 100 times 0.564 and therefore, it may not be 1.33, you please check what is the value and what should be the distribution. Maybe in the further section I will clarify, but please make sure that there is a mistake here. Mu is when you find an np, n is 100 and p is 0.01. I do not know where did I get 60 from. So, actually it is 50 minus 10. So, this is wrong, it should be this value should be this much, 40 divided by square root of 100 times 0.564. You can work out this is, this will come to 40 divided by square root of 56.4 and you can work out put that value and as shown in the previous case, now it is a positive value, 0 is here. So, your T value will be somewhere here, your 1.33 will be some positive value. If I call this value as Z 0, if I call this value Z 0, then your Z 0 will be somewhere here and you are looking for this probability. So, it will be more than a half. Please check it out, there is some error here, please correct it. So, what we have done? So, we have approximated binomial distribution by normal distribution. Now there is a confusion. Once we have approximated it by exponential and now we are approximating it by the normal distribution. So, which one to use that is clarified in here by stating that if n tends to infinity and p tends to 0 such that np remains constant, then it binomial can be approximated by exponential distribution. But if only n is very large and you have no restriction over p, then n can be approximated by a standard normal distribution under a central limit theorem. So, with this, let us summarize what we learnt today. We learnt the concept of random sample and sample statistic with the common distribution function f, which is a population distribution function. The whole purpose of this session and the sessions coming up is to get information about the population through the random sample. If you know the form of the distribution function f, population distribution function f, in that case, you know what is the, it will be called, I am sorry, it will be called a parametric distribution function, parametric estimation. If you do not know the form of f, then it is called a non-parametric estimation. In this course, we are considering only parametric estimation. We showed that the sample mean estimates the population mean and the variance of sample mean, here there is also a correction to be added. Let us do that. And variance of sample mean is population variance divided by sample size n. We showed that as sample size n increases, the variance of sample mean decreases. If we also introduced a distribution of sample mean as a using a central limit theorem and we showed that using central limit theorem, binomial distribution can be approximated as a normal distribution when the only sample size is large and you have to make no assumption on the probability of success p being small. Thank you.