 Hello everyone, welcome to the class engineering statistics again. So, in the previous lectures, we talked about confidence intervals and in particular we focused on how to construct confidence intervals using hypothesis test and then we talked about various test based on p values. So, in particular we talked about p values, p test, t test and f test. Now, in this lecture, we are going to continue our discussion of test of a hypothesis, but now we will focus on something called nonparametric estimation. That means, the one which does not need to make any assumption about the statistics we are going to use to make our decision either to accept or reject or null hypothesis. And then we will talk about various goodness of fit test for this nonparametric estimation. In particular, we are going to talk about three goodness of fit test. First one is chi-square distribution, chi-square test, then Kolmogorov-Spirnau test and then Deliphos test. Let us get started with what do you mean by this nonparametric estimation. So, the statistical methods we use so far assume the knowledge of the populations distributions and that too we assume they are parameterized. For example, we assume that the sample we have, we assume that this is where xi's are going to come from a certain distribution with a parameter theta and then we set up our test as whether this theta is at a particular value like is this theta corresponds to certain 30 and particular theta naught or r naught ok. And we did that using hypothesis testing. And in hypothesis testing if you recall, we defined something called power function which was like a beta of a theta that is defined as probability that my sample belongs to my rejection region. So, where r is a rejection region for example, r could come from your log likelihood ratio test which defined it like this. But notice that to compute these probabilities, we explicitly needed no this particular distribution and this probability was calculated under the parameter theta and when we try to give alpha level test, we try to find something like this belongs to null hypothesis right. If you and I hope you people all recall this discussions we had. So, in computing all this, we kind of explicitly needed to know the underlying distribution and this the probability of something falling under this region r, we calculated using the knowledge of this particular distribution. And we also looked into various statistics like for example, we use statistics to do t test and f test. But in doing t test for example, if you recall the t test we had something like x bar sorry yeah I think we had something like x x bar and sigma square and this is we assumed it to be Gaussian distributed right which was the case when it was when the samples are already coming from the Gaussian distribution. But we assume that this statistic is Gaussian distributed right or when the sigma square is not known we looked into the case when this statistics is student t distributed. So, we kind of enforced some distribution on the statistic itself which we used to make our decision right. But now the question is what if we do not want to enforce any distribution on the statistics beforehand that is what if my underlying distributions are not Gaussian and do I still need to make this make this assumption always to apply this test or put alternatively if I want to do certain these of this test I am invariably making this assumption that the underlying samples are coming from Gaussian distribution. But then how to check that indeed the samples are coming from Gaussian distributions for that itself we need a test right. So, thinking all of this we need a method where to know the distribution of the statistics we do not need to know the underlying distribution of the samples itself ok. So, keeping this in mind what we have so far discussed is all this hypothesis testing t test f test these are called parametric method because they explicitly made use of the properties of the underlying distribution or the parameters of the underlying distribution. And in this at least in the t test and f test they kind of assume that normality assumption work holds. But this is true only when we have large number of sample and in that case my statistics which was of the form x bar we could be think of following Gaussian distribution using our center limit theorem. But however, this is not the case when we have small number of samples and that is why before we are going to apply any of the hypothesis testing f test or t test that we used before we need to validate that the assumptions we are making on the underlying distributions of the sample is correct. So, to do that itself we need to have some statistics whose distribution itself does not depend on the distribution of the sample. If that is the case then we are going to call them as non-parametric method or distribution free methods ok. Distribution sorry I want to meant here distribution free methods ok. Now, suppose we want to check whether observed samples are going to follow certain distributions then we need to have have a certain test to check whether they follow the given hypothesis or distribution which is taken as null hypothesis and those tests we are going to now call it as goodness of test. Basically we are going to say that the samples we are going to observe are they going to follow given distribution and we want to check the goodness of that fit. Now, in that regard like assume that or like let us say you are underlying population distribution unknown and we want to check if the data follows a hypothesis distribution which I am going to denote as f naught. So, now the goal itself is like earlier when I have this samples I kind of assumed f i's are going to follow certain pdf or let us say some parametric cdf. Now, I am going to assume that this itself is not known I want to check this itself and that I am going to denote it as f of 0 here. Now, then what is my hypothesis now? Now, my hypothesis test my test can be now posed as that my cdf is that of f naught which I am hypothesizing it to be and this holds for all possible values of x and the alternative hypothesis is at least it differs at one point that is my distribution of the data points is not same as the null hypothesis distribution at least one point. Now, this hypothesis distribution now can be this hypothesis distribution can be either completely set specified with all the parameters. For example, this f naught could be associated with a probability density function which is Gaussian with parameter mu and sigma square here the parameters of the distributions are completely specified or it could be told that like our null hypothesis is a Poisson distribution with the parameter lambda or it may happen that this hypothesis distribution is only specified in terms of its shape for example, we only know that this is the null hypothesis is a Gaussian distribution that is it we do not know what are the parameters or we may be just told that it is Poisson distribution without specifying what the parameter is. Now, how to go about that how to go about checking whether my samples follow this null hypothesis for that we are going to see majorly two tests one for discrete random variables and another for the continuous random variables. In the discrete random variables we are going to use something called chi square test which is proposed by one of the famous mathematician called P. S. N. in I think early 800s. So, in this what we are going to do is compare the observed frequencies with that of the expected frequencies and the null hypothesis and as I said this will be used mostly for the discrete populations ok. And another test again introduced by famous mathematicians mathematicians and also statisticians Kolmogrom and Smirnov and variant of that by a Lilly force. So, here we are going to compare observed cumulative relative frequencies with that expected under the null hypothesis. So, notice that here we are trying to compare the frequencies of the classes I will make it a bit clear and here we are going to compare the cumulative relative frequencies of the distributions ok. So, we are going to compare cumulative relative frequencies ok. And this Kolmogrom Smirnov and Lilly force test usually is going to apply for continuous population density and more specifically Lilly force will be applied to check whether the underlying population is a Gaussian distributed ok. Now, let us focus on the chi square test. Now, our we want to test whether my observed data follows a given discrete population which I do not as F naught and we are going to assume that that is completely specified. For example, if it is Gaussian sorry we are going to talking about a discrete here maybe this F naught my distributions can be Poisson with lambda. So, here I am specifying the parameter or it could be let us say binomial with parameter n and p where both n and p are specified. Now, in this case if I am given a data I want to check whether it follows this CDF how I am going to do that. To do this I am going to group my data into k classes. What are these k classes? For example, let us say that my random variable x is there which is going to take values 1, 2 up to k ok. So, then there are k classes or my random variable is that it may just take value like head or tail head or tail. So, in this case my k is going to be 2 these are the 2 classes and similarly if it is let us say dies you have obviously 6 faces in which my k is going to be 6 like that. Now, what I will be interested in the expected frequency of the class i. Suppose I have this Poisson distribution let us take Poisson distribution let us say my x is Poisson distributed. Now, I know that x is going to take the value x is going to take the value 0, 1, 2, 3 like this or maybe Poisson instead of Poisson let us take maybe at this point it is easier to work with binomial. Let us say my x is binomial distributed and I have let us say the samples n samples and I know that probability that x going to take value some i is going to be n choose i p to the power i minus i and 1 minus p to the power n minus i. Let us call this as theta. So, this is the probability of observing value i and then my samples the expected frequency of observing class i is in this case is going to be n of theta. That is if I have n samples the expected frequency of observing the class i is going to be n i in this class. So, that is what I mean by E i here and now F i is the observed frequency. What is this observed frequency like now let me write this observed frequency here this F i is basically how many times you have observed maybe I will write it as j or maybe it is okay to write i, but maybe F i is equal to basically j equals to 1 to observed j indicator that your x j is equals to 1. So, it is basically counting how many times you have observed sorry this is i here. Now, observe your random variable has taken value i out of your n samples. So, this is your observed frequency of class i and this is the expected frequency of your class i under your null hypothesis. So, notice that under null hypothesis because my null hypothesis is completely specified I know what are these theta. Now comes the statistic. Now, what we are going to do is we are going to check how far this F i and E i's are. So, we are going to look into the difference and take the square of that that is basically the square difference of that and normalize them by their expected frequency and then submit over all the possible classes. So, this is way in a way this matrix is going to capture how different the expected frequencies and the observed frequencies are. Now naturally once we have this if this difference is too large it is kind of indication that maybe these what is my observed sample is not following the null hypothesis. On the other hand if this difference is small or this summed difference is small then it is a kind of indication that yes maybe it is like it is a kind of indication that my observed samples are following my null hypothesis distribution. So, based on this distribution one can compute or make a decision whether to accept or reject the null hypothesis. So, as we did earlier again we can set up a threshold and let us say for a given threshold Z alpha we are going to accept the null hypothesis if this statistic is going to be less than or equals to Z alpha and we are going to reject this h0 if it is larger than certain this threshold Z alpha. Now the question is can we quantify like how good or bad is our acceptant decision. So, now we may want to compute what is the probability that I reject my sample can we compute this probability and that under a null hypothesis. So, this is we want to quantize and if you are able to say that this is like less than or equals to some number then that is going to give me the significance of this test with that number. Now then the question is how to compute this probabilities do we know about this distribution. Earlier when we talked about hypothesis testing we kind of assumed this this statistics we assumed it distribution right we enforced its distributions to follow Gaussian distribution or to compute its distribution we needed to know the distribution of the underlying samples. But now in this case I am only looking into the empirical values of F i see here q is still a random quantity because this F i's are random here ok. Now will I be do I need to know the underlying distribution of my samples to compute a distribution or I can say without knowing that it. So, happens that in this case we do not need to know the distribution of the underlying samples and in fact, one can argue that this q is roughly distributed as chi-squared it has this q has chi-squared distribution with k minus degrees of freedom right. So, we will we will discuss more about that why this is the case. But notice that without requiring what is the underlying distribution of my samples I can argue that this q is going to satisfy chi-squared distribution with k minus of degree. Now once I have this I should be able to quantify the significance or the level of my test by setting my threshold appropriately particularly an approximate alpha level test is obtained by rejecting your null hypothesis when your q is larger than 1 minus alpha fourth quartile point of the chi-square distribution or like if you are if you want this alpha test we need to set z alpha such that your p of x is greater than z alpha is alpha where your x is chi-square distribution with k minus degrees of freedom. And since we know this chi-square distributions well we can compute it is a tail probabilities and tabulate them and for a given value of alpha we already know how we should be selecting z alpha so that my test is has a alpha level significance or it my test is a alpha level test. So, good now what we have argued is without knowing the distribution of the underlying samples we are able to say that my statistic has chi-square distribution and we can use readily available table to compute the significance or the level of my test or how should I set up my threshold so that my test achieves a given significance level. Now this test works well when or like this approximation that your q is going to follow chi-square distribution with k minus of degrees freedom well when the expected frequencies are more than 5 that is when your EI is greater than 1 for all the classes. This need not be the case all the time but this is a kind of thumb rule which will can be used when you have when you want your approximation to be good. Like for example, we said that EI is going to be n into probability that your random variable takes value i equals to n right. So, whatever this is like theta i this is known under your null hypothesis if your n is such that your it makes your EI larger than 5 then this is a very good approximation. Now composite distribution as I said every time the null hypothesis may not be completely specified only the shape may be given in that case we call the distribution is composite and now how to go about in this case. Now in this case let theta i's are the probabilities of the class now because the underlying parameters of the distribution is known we do not know this theta i exactly. So, what I mean by this suppose for example if x is Poisson distributed we know that probability that x equals to i this case theta i is equals to e to the power minus lambda I hope I am going to make it correct and this is i lambda. But so to compute this theta i I need the knowledge of lambda, but what if I do not know this. So, in this case you may estimate this lambda itself from the data that is if you have your samples you can estimate your lambda to be the empirical mean of this data and we know that this is a good estimator it is an unbiased asymptotically it is consistent and all and then once we have this we can plug in this lambda i's in this expression here and then you may get lambda hat lambda hat i and i fat and that is what I am going to call it as a theta i here based on the estimate. Now once I have this now I can also get an estimated frequencies which is now theta i hat. Now all I need to do is for my given EI values now I need to compute my statistic which is same maybe I had to write this as f i I think earlier I wrote this as small f i and I have this value only thing is I have replaced this E by n theta hat 0 subscript i here. Now what about the distribution of q? Does it follow the same distribution like we had earlier like which is a chi square distribution with k minus 1 degrees of freedom. It so happens that when the parameters that you are going to estimate are based on maximum likelihood estimator then indeed one can argue that q still remains chi squarely has a maintained chi square distribution. The only thing is now the degrees of freedom is now k minus 1 minus s. Now what is this s here? Earlier it was k minus 1 now you are saying it is k minus 1 minus s. Now here s is the number of distributions parameter that we estimated. For example here in the Poisson we estimated one parameter. In this case if you have to apply this method so then let us say we have observed only some k like even though there are infinitely many classes in this Poisson distribution because k the x can take value from 0 to infinity. Let us say you have in the samples that we have observed in this I only see certain k number of possible classes they belongs to let us say only k classes. So in that case my k minus 1 and minus one more term one is going to come because I have estimated one more parameter. So this is going to be k minus 2 in this case. So let us stop here and then let us continue how is k is indeed chi square distribution with a rough close sketch.