 So, so far we looked into the Kolmogorov-Zirnov test, which we argued that it is a distribution free test because the distribution of the test statistics does not depend on the underlying distribution of the null hypothesis. So, let us recap this what we said, okay let us look into the two things. First thing which looked into chi square test, where we said that we are going to look into the statistics which is given by cos 2 1 to n F i minus E i square divided by E i, where E i is the expected, sorry the frequency of the null hypothesis and F i's were the observed frequency from data. So, in computing this all we needed is F i's, okay and this required us to specify the complete distribution so that we can compute E i and in case we do not know that we estimated this, by first estimating the parameter and then of the distributions and from the distributions we computed the frequencies of the classes and there the good thing we are able to show that this q approximately follow chi square distribution. In particular when we said that when the distribution is completely specified we said that and we have to, we have k classes this q was exactly chi square distribution with k minus 2 degrees of freedom. Now, in KS test and again there were two things here, here we said that this is for discrete case. In the KS test our maybe we could have put substituted n, put subscripted n also here because this is based on or not n samples or maybe k just to indicate that there are k classes and so here it was k not exactly n because we were only making k comparisons one corresponding to each class. In the KS test we have this and here k q was chi square distributed and independent of the underlying, so okay maybe independent it respective of our null hypothesis distribution F naught x. And here we also said that this dn in the case of chi square test our statistics we could explicitly establish that it is chi square distribution. In dn all we could argue is that it is its distribution does not depend on, so we did not explicitly obtained the distribution of dn all we said is one can compute them and they are available in the form where dn alpha values will be specified from which we can obtain the alpha critical points of these distributions. But again here it was for continuous distribution, so notice that the KS test required us to specify completely the distribution that we are going to test. And in case we do not know all the parameter of this distribution but only the shape is specified one has to estimate those parameters and then use them in this distribution F naught x. However once you estimate that and plug in your and use it for your hypothesis distribution then the computation of the distribution of dn becomes complicated because of that it is not easy to establish the tables for it. And in fact it so happens that if you pretend that whenever there is no parameter specified of your distribution but if you even if you plug in the estimated one and pretend that that is the actual that is the true value and continue to use your dn tables the values you are going to obtain can be very conservative and because of that you may end up making more errors in accepting or rejecting your distributions. To somewhat overcome this issue we have another test called as let me make sure that I can write the spelling correctly. We will be looking into Lilly force test for short I am going to just call it as L test to check my hypothesis whenever my underlying hypothesis that I need to test is Gaussian. Here all I am specified is the distribute I need my null hypothesis is just Gaussian distribution and we have not been told what is the mean and variance ok. So, then I need to check whether my data samples I have they follow Gaussian ship how to go about this. So, now that is where Lilly force contributions come into picture. So now the suggested modification here is here we have dn in the in the case statistics we have. So, here it worked on all points x one possibilities we can instead of this we can use an another form of this which is I am going to denote as f of n and this is now taken to be standard normal distribution ok that we have earlier denoted as phi of z. Now, before I use it we are going to do some transformation on my data instead of looking into all the excess that I have we will be looking into the transformed one and how I am going to transform is like this. And here x bar is simply the mean empirical mean of the samples I have observed and s square is my standard empirical variance and this is like i equals to 1 to n xi minus x bar whole square. So, notice that I am as we did earlier I am using the denominator n minus 1. So, to ensure that my variance estimator is normalized and here s is I am going and I am going to take it square root to get my empirical standard deviation. So, now we know that roughly z we have basically centralized and normalized. So, this is roughly going to behave like a normal distribution with 0 1 and now I am going to compare this with basically the standard normal distribution ok. So, now everything remains the same and again the distribution of this can be computed not in explicitly form, but maybe through some empirical evaluations. But the good thing is this even in this case after doing all this transformation and comparing with the standard normal distribution I am it is still distribution free because I do not need to worry about what is the underlying distribution in computing or evaluating the distribution of dn. So, I think a lily force did extensive computation of the tables for this dn and again the dn alpha tables are available after doing this transformation and then again we can do our test using the similar criteria we have earlier. So, when I want to compute calculate or like I want to check whether my distribution of the data follows a Gaussian distribution we do this transformation and then you compute the statistic and for this statistic we already have that tables that gives me the critical the thresholds for various significance values then whatever dn value I have I am going to compare it whether this holds or not. If this holds then reject this to be following Gaussian distribution and then if it is less than this value then accept it. Well just let me check whether I did we say greater than or equal to equal to was include or it was strictly great yeah I think we made strictly great for rejection and and accept with less than or equals to ok fine. So, this is the summary of lily force test and the example is one can follow the same steps as we did here the only thing that needs to be done is we have to use this transformation before we applying that and use this standard normal distribution in the definition of dn and again use the dn alpha value which are computed for this specific statistics. So, simply when you are using when after you do this transformation and you when you use this standard normal distribution here just do not use the tables that you used here to when you try to apply the K statistics. So, use a separate tables that are available for this lily force test and based on whether it is larger or not you can accept or reject your hypothesis that your distribution follows Gaussian distribution or not ok fine. Now this is all we computed we did we studied several methods about whether whether my data as follows certain distributions and if the distributions are kind of fixed we are only not sure about the parameters then we also studied various hypothesis test for that. So, we studied various parametrics and non-parametrics method, but often before we get into this parametrics and non-parametrics method one can do simple visualization of the data itself and see that whether the data is following my hypothesis distribution. So, for that there are various methods I am going to quickly discuss two of them. So, this is something called exploratory data analysis. So, let us say you have been given two distributions CDFs G and C and you want to check whether how close or how similar or how dissimilar they are. To do this we are going to look into two possible or yeah we are going to study two things one is called probability probability plots also often called plot and another is called quartile quartile plots sorry, quantile quantile plots called QQ plots. So, this is what we what we are trying to do is explore the data analysis for goodness of fit here. So, let us try to look into what is this probability probability plot. So, let us say you have this X for every possible value of X you can take f of X here. So, the f of X range was going to be we know that f of X range is going to be between 0 1 and so is G of X. So, on the X axis we are going to take all possible values of f of X and on the Y axis you are going to look into all possible values, Y axis represent all possible values. So, let us say this is 0 1 so this is like you are looking into this box here because I do not need to go anywhere beyond sorry this is 1 and 1 and this is 0. So, let us draw one line with slope 45 I say from this figure you can see that this is not exactly a square shape it is looking more little rectangular but assume this is just like a square here. So, I have drawn a diagonal which has a slope of 45 degrees. Now you can plot G of X versus f of X and maybe I do not know like for all possible values of X you may get something like this and for a given set of so if you are G and G and if are completely specified to you maybe you will get like some continuous line or all possible X may get or you may get like something like this oh sorry I made a mistake this is like a CDF or whatever like I can go up down whatever and then or you may get something like this. So, now clearly if your line is your plot of f versus G lies very close to your 45 degree line then you can claim that your data is following up you can say that these two distributions are similar right and if it is too much off from your 45 degrees then it kinds of gives indication that maybe the these two distributions are different maybe you need to explore this data more or at least do further systematic analysis to actually say that okay you can declare that these are very different. Similarly, instead of going for PP plot one can also do QQ plot. So, what is QQ plot doing instead of simply plotting f and G we can plot f inverse of P and G inverse of P and here P has to be between 0 1 right and notice that now the range of f inverse P and G inverse can be like entire real line right. Now, again we can do the same thing is here. So, for all possible of P values where P is between 0 1 you get the values for particular let us say this is a particular for a particular P let us take a particular P and this is one value correspond to that then I will get like this is corresponding to let us say this is like f inverse P and like this is like a G inverse P I will get this one value for a particular value and like that you do it for all possible values of P ranging between 0 1. So, for every possible value of P you will going to get a different lines and you see that if f inverse is same as G inverse you are again going to get a 45 degrees line which is going to pass through okay and if they are not then you are not going to get a line which is going to be close or overlapping with this 45 degrees and based on that you can based on whether your line is going to be closer or how far from your 45 degrees line you will get a sense of whether your data is going to follow the given distribution or like in this case you will get a sense of whether these two distributions are same or not. Using this we can now think of comparing what we have I have this Snx which is computing from my data and f0x which is from my null hypothesis can compute okay often in this case instead of the pp plot qq plot become easier to plot because of the properties of Snx. So, what we know Snx first of all takes value discrete values like 1 by n 0 1 by n 2 by n like this. So, because of this I need to when I have to look into the qq plot I have to look into Sn inverse and f inverse I only need to calculate this Sn inverse at this discrete point that is I need to calculate Sn inverse at x where x is this quantity 0 1 by n 2 by n n and moreover once I have this given data x1, x2, xn and I have this ordered them and I have this order statistics I know that Sn inverse at let us say some i by n is exactly equals to x of i you know this right already this is the property of your empirical distributions. So, because of this let us say you are plotting your qq plot your x axis let us say if you are going to plot you are going to make x axis corresponding to your empirical distributions. So, these points are going to be simply x of 1, x of 2 like this let me say i like this and so this is like corresponding to 1 by n and now on the so now whatever this corresponds to now on this this has this point corresponds to where your f of 0 is let us say 1 by n and here this one corresponds to maybe let us say if this corresponds to you will have this point to be let us say f of 0 by 2 so like this. So, all you need to do is on your y axis you have the points f 1 by n and f 0 2 by n and the corresponding points on your x axis is going to be x of 1, x of 2 and now you need to see you have these points and let us say this is corresponding to f 0 of i by n let us say this is this you need to take the point through them and see how close it is to your 45 degree line. So, this gives you when you are looking into the QQ plot against your empirical distribution when you are comparing QQ plot of your empirical distribution with that of the null hypothesis your kind of x axis you know which points you have to look into on the y axis you know which points you have to look into. So, you know what is your curve and all you need to do is check whether that curve has a 45 degree slope. If it has a 45 degree slope then are very close to that it is a good indication that this must this your data is following null hypothesis distribution otherwise you need to do some more confirmatory test like what you have done before. One small issue with this is like we know that when we have this x of n here. So, x of n this will corresponds to the point f of n by n right. So, this should be inverse here right. Now this will corresponds to when you have x of n here this will corresponds to f inverse n of n and we know that f inverse of 1 is infinity. So, because of this at this as n is when we are exhausting all the n points when you are reaching my last point of my order statistics my f inverse is tending infinity. So, often to overcome this issue the packages will do slightly different things. So, what we have here is actually. So, what we have is when we are going to plot this what we are basically doing is f inverse of i by n and x of i and we are plotting them like this is for we are plotting them i equals to 0 1 2 up to n right. And here when i becomes n this point is at infinity the y axis is infinity. So, most of the packages to overcome this issue they will look into the slightly modified version of this I am directly taking it from this like this maybe they will do some small alteration to this by taking i minus 0.375 divided by n plus 0.25 and then do this. So, when i is becoming n you need to do some corrections so that you are not hitting infinity on your y axis. So, one of the some I am just mentioning that there could one of the possible correction is to do this. So, that you can still observe some value on the y axis and this is just one of the examples you may do various different combinations so that you can show something still on in your plot. So, often even though we discussed this exploited data analysis at the end maybe this is the first thing you want to do you want to just draw your qq plots pp plots and see how close they are how close they are compared to the 45 degrees lines and if they are pretty close you get a good confidence that ok your data is as per your null hypothesis otherwise maybe data is not enough. So, you need to do some more test and that is where you can revisit all the hypothesis testing we did or when you are distributions are parameterized and or you want to use a full knowledge of the distribution or you can use non-parametric methods when you do not want to use the knowledge of your distributions to make this test ok. So, with this we will conclude this. Thank you very much.