 today we are going to cover the session on what is called probability plots I believe you all have experienced it or played with it and experimented with it during the R sessions which Professor Guru Rajan�보 technically introduce it to you as to what these plots are and what different kinds of plots are available. So first let us review what we have done in the past. We introduced random variable as a real value function with a probability space as its domain. Then we introduce expectations of this random variablemb해요. And re also introduce certain special random variables which have specific models of distribution. So, we introduce 2 kinds one is a discrete distribution functions in which we covered a Bernoulli trials binomial distribution geometric distribution negative binomial distribution and hyper geometric distribution. While in the continuous distribution function, we introduced uniform distribution function then normal distribution function, then some derivatives of normal distribution such as chi-square, p-distribution and f-distribution, we also introduced other distribution such as log normal distribution, viable distribution, exponential distribution. I must mention that there are many many many distributions available and many many new distributions are discovered in order to meet today's data requirement. But we have shown you a few of them which you come across more frequently and which will be useful to you in your immediate engineering requirement, immediate materials data requirement. Of course, as and when we move forward, the new distributions that we may require in the further analysis will be introduced in details there. Now in all these issue the question comes that we say that a random sample has been drawn from a distribution or specific distribution. How do we know that it really comes from that distribution? The reality picture is something like this. You have a very large population which you are trying to study. For example, if you are studying the yield strength of a particular alloy which is produced in a factory, in a industry then what how are you going to guarantee that it is going to have this particular yield strength? Well, the method is that you will take a few samples, random samples which you will choose them randomly not a systematically and then you will derive certain statistics and you will derive certain values and you would like to see if this distribution and you would have assumed that well, theoretically the yield strength should follow say log normal distribution or say normal distribution. Then the question remains is that does sample say that or not? This is the question we would like to answer in this session and this can be answered through what is known in statistics as goodness of fit tests. Goodness of fit tests are a very theoretical derivation of comparing the data values data CDF with the assumed CDF. But in the there is another method which is called a graphical method which does not give you a strong proof but it gives you a confirmatory guideline that yes it you are our assumption that this particular sample comes from this distribution may be correct in this scenario and this is what we would like to cover. We are not going to cover the theoretical goodness of fit tests in this course but we would like to cover some of the graphical methods for confirmatory guidelines. We are going to consider primarily two such graphical methods one is called probability plots or PP plots and the other is called a quantile plot which is known as queue queue plots. How do we go about doing this graphical comparison? We assume that data is coming is a sample from a particular distribution your data what you have got is a sample values coming from a particular assumed distribution. Hence naturally it means that whatever cumulative distribution function that you will obtain from data should match with the assumed distribution cumulative distribution function similarly if they are coming the sample data is coming truly coming from the one particular assumed distribution then the quantiles that we have calculated from the data should match with the theoretical quantiles that you would get from the assumed distribution. Such matching can be worked out in two ways if you are plotting the data CDF versus distribution CDF and if both are equal they should fall on a straight line so this is case one. The other one is if data quantiles if you plot against the assumed distribution theoretical quantiles then they should also match and they should fall on the straight line x is equal to y this is case two. So here I have detailed described PP plot CDF is calculated from the data and it is called an empirical CDF then another CDF is calculated from the assumed distribution and it is called a theoretical CDF. PP plot refers to plotting the empirical CDF on y axis and theoretical CDF on the x axis. If our assumption is true that the data is truly coming from the assumed distribution then the points on this plot should fall approximately on y is equal to x line otherwise we should be able to clearly see a mismatch. So let us see the two plots. Now in this plot what I have done is I have taken I have simulated standard normal variates using random number generator I have simulated about 100 of them and then I have calculated from that data the cumulative distribution function let us do some recalling here cumulative distribution function of data is nothing but number of data or let us call it a f of x less than or equal to t is nothing but number of data points less than or equal to t divided by n which is total data points. So this is called the empirical CDF ok. So you calculate the empirical CDF and then from a standard normal distribution you have a CDF which you can call f of t again then it is nothing but integral minus infinity to t 1 over square root 2 pi exponential minus 1 half x square dx. So this is called theoretical CDF. So the values of this theoretical CDF on t is given here this is the next axis which gives the theoretical CDF. This empirical CDF is plotted on the y axis and that is plotted in that is shown in here. So if you take any typical value here this says that theoretical CDF which is in here this is your theoretical CDF value and this is your empirical CDF value. So this is how all these points are plotted and now what we say is that if your assumed distribution is the correct distribution for the data that is your data truly comes from the standard normal distribution which in this case we know because we have generated it randomly. You can see that it should fall on the line this is y is equal to x line. So it should fall approximately on y is equal to x line and you can see that these points are falling and therefore it says that ok it confirms graphically that the data point seem to be coming from the standard normal distribution. Let us take the case of mismatch because sometimes we understand only when we see the matching but more we understand if we see the mismatch. Now here we go here I have generated a log normal data by random number generator. So I have a log normal data generated log normal data generated from the normal from the random number generator and I am assuming that the data is actually coming from a viable distribution. I have generated the data so it means that I have a data from a population which actually has a distribution log normal while I have thought that I have actually drawn a sample from a viable distribution. I have very purposefully taken these two distributions together is because in the field of metallurgical parameters or properties such as strength property you take yield strength you can take UTS. We can consider even the fracture toughness there is always a question which distribution is closer and the two competitive distributions for all the strength of the strength property of the material is log normal and viable. So here I have taken the log normal and viable distribution as two competitive distributions. So I once again repeat what we have taken is we have taken a data actually from the log normal distribution but I have assumed or I have believed that the data is coming from the viable distribution then this becomes my theoretical distribution and this becomes my empirical distribution and I have plotted once again this shows the viable CDF and this shows the empirical log normal CDF that is it is only an empirical CDF it is not really log normal I have originally drawn the data from log normal. So please let us understand it clearly this is a data this is coming the empirical means that it is coming from data but for my understanding in this course I have generated this data from log normal CDF therefore I am writing log normal otherwise it is a data. So just as we did it in the previous case for data the empirical CDF is a ratio of number of data points less than or equal to t divided by total number of data. So this is your empirical CDF for value t and then this is a viable CDF which I call a this is calculated from the viable CDF distribution function and I call it a viable CDF and now you see that if I draw this line which is x is equal to y this is x is equal to y or y is equal to x line you see that the data is systematically falling above and falling below there is no random behavior I mean there is there is it is not an error difference as it happened in the previous case the data is systematically going above and then systematically going down and therefore we understand that there is a mismatch with your empirical CDF and your assumed CDF your assumed CDF is far away from what your data says and remember data is what we believe data is what we believe now let us consider understand the qq plot here again like CDF we take the quantiles of empirical distribution and we plot it against the quantiles of theoretical distributions okay you please recall what is quantile you have understood the quantile in terms of quartile okay we are talking about quantiles we have come across the definition of quartiles quartiles is such that first quartile q1 is such that probability of data x less than or equal to q1 is 0.25 quartile 2 is such that probability of x less than or equal to q2 is half and q3 is a third quartile where probability of x less than or equal to q3 is 0.75 so if you want to divide the data into four equal parts your data is here if you want to divide the data into four equal parts such that the probability from negative infinity to q1 q1 to q2 q2 to q3 and q3 onwards are all equal these are all equal probability and they are all 1 fourth this is what is called quartile and if you recall this is also known as median quantile is a general term so if you wish to have this probabilities your data be divided into say five equal parts then you will have if you want to divide it into five equal parts then you will have each of this there are six parts here right there is one two three four five and six so each one will have a probability one six so you will have a five values which I call p1 p2 p3 p4 and p5 then my this will be called pentile so if I take a first pentile p1 is such that probability that x is less than or equal to p1 is 1 over 6 and likewise p2 is such that probability of x less than or equal to p2 is one third or 2 over 6 likewise you can define so quantile is a general term if you want to consider the case of dividing the data into four equal probabilities you will have quartiles if you want to sorry these are not pentiles these are a hexiles if you want to divide it into six equal part it will have five points p1 p2 p3 p4 p5 which will divide each data into one sixth they are called hexiles you can have this isles you can have centeniles etc etc I mean you have 90% data here and there you mean there are various ways of doing it so in this way you can define a quantize in the example which I am going to show that there is a matching and there is a mismatch I am going to consider the size I am going to consider the size it means that the data will divided into 10 equal probability parts where each part will have a probability of 0.1 0.1 1 over 10 and when I define like this you remember it is a cumulative probability that I am defining so I am defining all the values here below this and therefore it becomes 2 times 1 over 6 okay so let us see the plots now so if you look at the plot again I have taken the same standard normal variates generate data generated by using random number generator and I have calculated their desiles so there are one sorry there are 1 2 3 4 5 6 7 8 and 9 points remember if it is a desile you are bringing it into 10 parts there will be 9 quantiles 9 desiles okay and these 9 desiles I have plotted if you take any typical 1 if you take any typical 1 this shows the data desile or as I have called it empirical quantiles and this is theoretical quantile calculated from normal 0 1 and I have plotted each value and you can see that there is a way deviation I have made the points very big but otherwise you can look at the center of these points and you know that they are little above on the line or little below etc because I am actually taking random number generated standard normal random generator using using random generator to generate the standard normal distribution now if you look at the mismatch again I have done the same thing I have taken the log normal distribution to generate random variables of log normal in other words I have used a random number generator and generated log normal random variables and I calculated quantiles and I am thinking that they have all come from Weibull distribution and I am comparing them so I am doing little bit of an artificial thing but this is to drive the point home I am following this random number generated values here so this is once again empirical desiles and these are Weibull desiles these are also desiles there are exactly 10 if you look at it there are 10 9 data points 3 and 3 is 6 and this 39 and you can see that very systematically it diverts away from the line which is x is equal to y and therefore it says that what you have assumed your data coming from is not the case your data is coming from some other distribution than the assumed distribution of Weibull once again I repeat that here we are comparing the empirical desiles with the Weibull desiles this is calculated from data so if I call d 1 as the first desile then d 1 is such that probability of data x is less than or equal to d 1 is 0.1 d 2 that is this is d 1 point this is d 2 then this d 2 is such that probability that x is less than or equal to d 2 is 0.2 and likewise so this is d 1 this is d 2 like this I have calculated the interesting part here is that in the previous case we were matching probability with probability here we are matching the data value with the data value so this is something strikingly different in this case there are other plots same plots are made in a different way for example instead of considering your x axis instead of considering your x axis as the actual quantiles or actual cdf and y axis also as cdf it says that you take y axis or x axis as the data values so x axis becomes the data values and y axis is taken as a theoretical probability scale there are when there were the computers were not so common when we studied statistics there used to be normal probability scale paper available in the market and y bull probability scale paper available in market now you don't need as you have done it in the r exercise you can very easily give a command as to what should be your y axis what theoretical probability scale you want and then it plots the data values against the theoretical probability value and again the matching has to be at x is equal to y so x is in the usual numeric scale showing the data probability values are plotted against the value of random variables that it takes and here also it falls on a y is equal to x line so the exercise that you have might have done in the descriptive statistics r sessions are largely using the probability scale as one of the as y axis and x as a data axis so let us summarize what we discussed today we talked about graphical methods to check if the distributional assumptions made on the data are matching or not matching if you plot an empirical cdf against the theoretical cdf it is called a pp plot if you plot a data quantiles against the theoretical quantiles then it is called a qq plot the same comparison can be made by plotting empirical cdf on the you can plot the empirical cdf using a probability plot papers and this probability papers i do not know if the market any more sells it but at least you can have it easily on any software package that does the statistical analysis in particular r has this facility in all the above cases it is a matching if it matches x is equal to y line if there is a mismatch then that is called a indicate a mismatch with your assumption and where the data comes from thank you