 Welcome to dealing with materials data. In this course we are going to learn about collection analysis and interpretation of data. We are looking at the second module which is on descriptive statistics using R and we have been looking at how to deal with distributions while presenting experimental results. Specifically, we have been looking at some grain size distributions. This is a case where the steel consists of two phases and the grain size data of both the phases is available in a CSV file and it is very clear that if you just look at the mean and standard deviation it is not sufficient to describe the data. So we need to find the distributions that fit to the given data. So that is what we want to try in this session. Of course we have not done probability distributions and we are going to do that as the next module. So some of the ideas that we are going to use here we are going to revisit after we do the session on probability distributions. But at the moment we will just use some existing libraries and use this data and look at the fit and identify what fits the given data better. So we use fit distribution plus library fit R plus. This is used to identify the best distribution for fitting the data and it also estimates the parameters of the distribution that fits the data. What we are going to give in this session is a very tutorial introduction. So I am not going to explain many things it is just like a command you give the command and you see the results and you will know what distribution it is and you will give another command just to fit it for that distribution. But we will revisit and discuss some of the details after we go through the probability distribution module. In order to understand how this distribution works of course it is important to know about skewness and kurtosis and you might have already been taught about skewness and kurtosis. Skewness tells how long is the tail and it is said to be positively skewed if the data is having a long tail on the right and it is said to be negatively skewed if the tail is on the left side of the data. On the other hand for a normal distribution the data has tails on both right and left side and in fact if it is very nicely normally distributed you will find that it is also symmetric about the mean on either side of the mean it will be having the same type of tail. But if the data is skewed positively or negatively then you will see that it has a longer tail either on the right or on the left. Kurtosis is also information about the tail and it specifically talks about the outliers and it gives you information as compared to a normal distribution how heavy is the tail in the given data so that is what this gives. So by looking at these two quantities it is possible to find out what the best fit in terms of probability distribution for the data could be and that is what we are going to do. And to little bit better understand what these quantities are we know about moments about the mean. So mu k is the kth moment about the mean it is defined as follows. So you take the data value and you take the mean which is the first moment about the origin and you take the difference to the power k and you multiply by the probability distribution. This is what we are going to discuss in detail in the next session. But for now it is enough if you understand that f of xi basically gives you the probability that the random variable will take the value xi and we are assuming that from that distribution is where we are getting these values and so this is basically a probability. And mu is the first mean first moment about the origin that is the mean and sigma squared is variance so it is the second moment about the mean itself. So xi minus mu whole squared f of xi is basically sigma squared so that is the variance and skewness and kurtosis are basically third and fourth moments about the mean. So if you put cube and power 4 here you get skewness and kurtosis but it is not just putting 3 and 4 here you also divide the resultant quantity by either sigma cube or sigma power 4 to get skewness you divide by sigma cube and you to get kurtosis you divide by sigma power 4. So these two numbers that you generate so third moment about the mean normalized by sigma cube where sigma is the standard deviation you know it is the square root of variance. And so this quantities so skewness and kurtosis is what we are going to use to understand what is the probability distribution that describes our data properly specifically we are going to look at the grain size data and understand how it is distributed. So you can see that our grain size data for phase 1 and phase 2 has huge skewness and also kurtosis. So you can see that the distribution is of course one sided so it has a long tail to the left and this also has a long tail to the left. So in our definition we will say that this is negatively skewed and also you can see the fatness or thickness of the tail. So as compared to a normal distribution of course these tails are much more fat. So this is the information but we are going to get numbers for these two quantities and they are defined in terms of moments about the mean and appropriately normalized. So that is what we are going to do and for doing that we are going to use fit distribution plus library and while we analyze the data and come up with a fit for the given data we also have to evaluate how good is your fitting and for that there are measures specifically you will see that R gives you information about log likelihood AIC which is Akai Case Information Criterion and BIC which is the Bayesian Information Criterion. Of course we will come back to understand these quantities better after we learn about probability distributions and inference and things like that. But for now you just have to pay attention and see what are these quantities that are returns when you try to do the fitting. So let us go and do the fitting as usual. So we will start with getting the data. So let us start R and this is version 3.6.1. We need to know the working directory and so we are in the right directory. So we need to first invoke the library and let us do that. So we want to use the fit the str plus and then we want to read the data. So the CSV file from data grain size dataset 2.csv is read and then we are going to find out the face identity 1 and 2. So we are going to save those row numbers in i1 and i2. So if you pull out from x all the i1 ones that is for phase 1 let us call it x1 and for i2 it is all for phase 2 let us call it as x2. So we have done and we have already seen this data. So there are 3664 observations and there are 6 variables there and of which about 457 is for phase 1 and remaining 3200 are for phase 2. So we have done this. So now let us use this command. So it gives you the, so what is DESC dist? So let us look up. So it is description of empirical distribution of non-sensor data. There is this difference between sensor data and non-sensor data. Suppose for some reason to save time or because you are not able to continue the experiment for longer times if you arbitrarily stop the experiment at some time or beyond some particular point then that is called as sensor data. What we have is not sensor data. So this is DESC dist is for description of distribution that is what description of distribution is what it is and it is for the empirical data and the data should be non-sensor and in that case you can use this and if you look at the data that we have for phase 1 the grain size distribution data then it lists several theoretical distributions all of which we are going to learn in the next module. Normal distribution, uniform distribution, exponential, logistic, beta, lognormal, gamma and it also tells you that Vable is close to gamma and lognormal right. So if you have gamma and lognormal is the dotted point here. So the Vable is close to these two distributions and so where is our observation? Our observation lies in this band which is for beta distribution. So this graph it is called Kalan Fre graph and it is a graph of square of skewness versus kurtosis and so for example for normal the kurtosis is here at 3 and the square of skewness is here which is close to 0. So by taking these two values it knows that if some data falls somewhere here then it must be normally distributed and so on and so forth. So because our data falls somewhere here in the beta regime we know that the data is probably best described by the beta distribution and it gives you the this we have already seen that minimum value is 20.8 and maximum is 24.3 in this case. The median value is 24.3 because we saw lots of data points which were at 24.3 and the mean was 24.1 and the standard deviation was 0.4. So it was 24.1 plus or minus 0.4 so it gives you the same data and it now in addition has estimated the skewness and the kurtosis. So let us do this same thing for the data for phase 2 the grain size data for phase 2 again we see that this has a much larger range minimum is 11.9 maximum is 24.3 median is again 24.3 so you can see that the median is the same and the mean is also quite close. So it is 23.4 and standard deviation is about 2. So it is 23.4 plus or minus 2 and this is 24.1 plus or minus 0.4. So these 2 data points in terms of mean and standard deviation if you look at they are almost the same but obviously the skewness is different or quite close it is not very different. So it is minus 3.1 and this is minus 2.9 so this is not very different and kurtosis is again this is about 16 and this is about 11.4 so that is some difference. So obviously it is not the same as phase 1 but it is also a beta distribution and it is slightly different from the previous one. So we notice that our data again falls in the beta distribution regime but it is different from the earlier one. So we can also try to get them both in the same figure by doing this that will make life easy for us to compare. So you can see that so this goes to this point and so these values are different. So this is 12 and this is 16 so this is somewhere about 16 and this is somewhere about 11. something and in terms of the square of skewness so this is somewhere about near 10 and this is less than 9 but in both cases it falls in this band which is for beta distribution. So let us go back and try to fit the distribution now that we know that it is beta etc so can we fit for fitting to fit to beta you will learn that the value has to be between 0 and 1. So that is what we are going to do we are going to normalize the values to be between 0 and 1. So x is nothing but the x1 values divided by the maximum and y is nothing but x2 divided by its maximum. So we have two normalized values and let us use this let us try to fit it to beta. So we say that fit to the distribution take the data x and fit to the distribution and fit it to beta and we are going to use the data and the fit will be saved as fit dot b1. If you try then we get the information that function MLE failed what is MLE? MLE is maximum likelihood estimation and if it failed then we can try to use other methods to fit. Let us use this MME moment matching estimation. How do we know these methods? Of course you can use help fit this for example you will get this information. So it says fit of univariate distributions to non-censored data by maximum likelihood or moment matching or quantile matching or maximizing goodness of fit estimation MGE. So let us try the MME so for that you have to say method equal to this. So that fit works and you can get information about that fit. So you can see that fitting of the distribution beta by matching moments and these are the parameters and it gives you this log likelihood AIC, BIC all to be infinity. So this is what I said we want to understand what these quantities are but we will come back to it after we do some more modules and when we learn about inferences and things like that we will come back and take a look at it. So you can do the fitting and then of course you can plot. So you can see that the data and the density plot is here and the CDF plot is here. So these are the data and the red line that runs through is basically our fit and you can see the QQ plot and so that also seems to fit well and this is the PPP plot. So what are these QQ plots and PPP plots we will learn when we look at the distributions and learn about these quantities but for now this seems to fit well and so we can do the same exercise for the second data Y, P2. So we again see that MLE has failed. So we will again use method to be MME and see if that works obviously that works. So you can look at what this fit is. So again you get these log likelihood parameters and AAC, BAC parameters to be infinity, we will come back and understand what it is but for now we can try to plot and see. So again you see that the empirical and theoretical densities, the empirical and theoretical cumulative distribution functions and the QQ plot and PPP plot they are all okay and as compared to the previous case the QQ plot is slightly off but it is still okay, it is fitting most of the data and so that is what we are realizing. Now that we have done this exercise we have been looking at also the electrical conductivity data of ETP copper and we noticed that that data was fitting or looking like normal distribution. Is it so? Can we check if it is indeed normal distribution? So for doing that let us do this. So we are going to read the data that is ETP copper conductivity data and we are going to say describe that distribution, that data and we find that our observation of course lies along this star which is normal distribution. So this is what we have been noticing and that this is minimum is 101.1, maximum is 101.5, median is 101.3 and mean is 101.32 and standard deviation was 0.1. So these we have already seen and this Q-ness you can see is quite close to 0 and cut us is quite close to 3. So this shows you that this is very nice normal distribution of course we can check that it indeed is so, how do we do that? So we try to fit this to normal distribution. So we say that okay fit the distribution, take the X data and fit it to normal distribution and give the summary of the fit. So we again find that of course it fits and it use the maximum likelihood method and this is a mean and standard error and the standard deviation. So it is like 0.1 and this time you can see that the log likelihood the AIC, BIC etc are not infinities. So it is giving you some numbers and it also gives you what is known as correlation matrix. So we will at some point look at what it is. Of course let us plot the normal fit we have made. So you can see that the experimental and the empirical and theoretical densities match and the cumulative distribution functions match and the QQ plot is a nice line as also the PP plot. So we can see that in this case everything is nicely following the normal distribution. So to summarize so we have been looking at data and sometimes we find that the data can be better described by distributions for example in the case of conductivity this is repeated measurements which give you values about some mean and the distribution is there because of random noise and that is why it is a normal distribution. But on the other hand every single measurement gives you a set of distribution for grain sizes and this obviously is not a normal distribution or a bell shaped curve. So to describe this kind of distributions you can use this library fit DISTR plus and you can get information and generally the methodology here is that by looking at where the skewness and kurtosis values lie we decide what could be the best theoretical distribution that will fit the given empirical data. So that is the exercise that we have done and we will come back to some aspects of this fitting exercise after we go through the probability distribution. Thank you.