 Welcome to dealing with materials data, we are looking at the collection analysis and interpretation of data from material science and engineering. We are in module 4 on data processing and we are looking at how to go from data to the underlying distribution. We have known how to estimate properties of a given data set or data series. We can calculate the average and the mean squared deviation and root mean squared deviation from the average of the data. But we know that the data is a sample of the underlying probability distribution. So we want to get the averages of the, from the averages of the data, the properties of the probability distribution. So how to go from the quantities that we calculate here to the properties of the probability distribution is something that we are going to discuss in this session. We are going to assume that the data is independent random samples and the best estimate of the mean of the distribution is nothing but the average of the data. Best estimate of the variance of the distribution is slightly larger than the mean squared deviation from average of the data. So if you have the MSD, you multiply by n, divide by n minus 1 then you get the best estimate for the variance of the distribution. For large n, you can see that n and n minus 1 are not going to be too different. And standard deviation of the distribution of course is the square root of the variance. But if this assumption of data being independent is not true, then the variance that you will get will be even larger than what you estimate from this formula. And the estimates of the mean and variance of the distribution from the average and spread of the data, we call this as point estimates because we are just calculating one number. For example, we calculate the average of the data and we say that this is the best estimate for the mean of the distribution. We calculate the MSD and from which we can calculate the variance and we say that that is the best estimate for the variance of the distribution. So these kind of estimates are point estimates but sometimes we are also interested in giving interval estimates which we will also discuss in this session as we go along. Sometimes we want to know how accurate is the mean that we have estimated. If you have more and more data points, of course the average that you will get from the data will be closer to the true mean of the distribution. The accuracy of the mean is given by the standard deviation but it is not equal to the standard deviation. The average of the data x bar is also a sample from the distribution. So if you generate lots of data and lots of such averages that will actually help you recover the distribution from which the average itself is sampled. So it is possible to do large number of experiments and to get better estimate for the mean by getting the average from several data sets. Now variance again if you assume that the measurements we are making are all independent is the variance of the average for example that is x bar the variance in that quantity is given by the variance that you will calculate and divided by the data points. And standard deviation is nothing but the square root of this. So it is sigma hat by root n where this delta x squared is the squares of the data and you average it minus the average and you square it. So this is the quantity that we are incorporating here. This is true only when the statistical variations in the measurements are all independent. If they are not independent individual fluctuations do not add up like this and so error becomes larger. What happens if the data is correlated? So correlation also should be accounted for when we are calculating the variance and how to do it in a particular scenario is something that we will discuss later in the case studies. Now let us go back to the estimating the mean. If the measurements are samples from a normal distribution so we are making here an assumption about the distribution from which this data is sampled and since the variance has a spread we have to look at this quantity root n x bar minus mu by sigma hat or x bar minus mu by sigma hat by root n and that goes as a t distribution with a new degrees of freedom and this degrees of freedom is determined by the total number of observations n minus 1 because we have already calculated one quantity from the data which is the average. Using t distribution we can give a confidence interval by that what we mean is that we can say with 50 percent probability the mean will lie in this rate or with 95 percent probability the mean will lie in this rate or with 99 percent probability the mean will lie in this rate and so on. These kind of estimates where we are not giving one number for the mean but we are saying what is the range in which the mean will fall is known as interval estimates. So you can either give point estimates you can take the data you can for example average and give that as a maximum likelihood estimate for the mean of the distribution or you can say that okay the mean will lie in this range with so much of certainty 90 percent probability that the mean will lie only in this rate. So those kind of things you can make this is from the t distribution assuming normal this is t distribution because there is the sigma also which has a spread. But suppose if the sigma is known exactly and you do not have to estimate it from the data then you can put sigma here and you can see that that distribution is actually standard normal. So it is also possible to estimate from the standard normal distribution the intervals. So you can say okay with 90 percent probability the true mean will lie in this range and for that we have to use the standard normal distribution. So whether it is t or standard normal is determined whether it is sigma hat or sigma, sigma hat meaning we estimated it from the data sigma meaning we knew it and we did not calculated from the data. One can also ask questions about the accuracy of variance the statistics of variance is statistics of sums of squares of random variables. So it is a chi-squared distribution the relative standard deviation of variance is 2 by square root n minus 1 and relative standard deviation of standard deviation itself is 1 by root of 2 into n minus 1 and these were obtained from the chi-squared distribution. So it is possible to estimate also the accuracy of the variance just by knowing this numbers one can get these numbers fairly easily for example let us say that we have this conductivity data that we are looking at the mean of which is 101.3 and the standard deviation is 0.1 we have made 20 measurements. So that means the relative inaccuracy in the variance that 0.1 is 1 by square root 38 and that is 0.16. So the actual number could be 0.1 plus or minus 0.016. So because of which if you include the error then we should report the number as 101.3 plus or minus 0.12 but because of the significance of the digits we are not going to go beyond one significant digit because of which we will still report it as 0.1 but if the significant digits are higher or if this number is not 2 but something like 6 or 7 then the numbers would actually change. So one way of looking at accuracy of variance is to look at the relative accuracy and report numbers accordingly. So we will just go to the data that we have and which is the data on the conductivity and let us look at the point and interval estimates and let us also try to understand where these estimates are coming from. So the first thing to do is to we are going to use the t distribution because we are going to assume that the data is normally distributed. So let us just plot the t distribution and see how it looks like. So this is the code. So we are very familiar with this so we have a sequence minus 3 to plus 3 and we are going to plot this sequence and the probability density function for that sequence with 19 degrees of freedom using the t distribution. So that is what is plotted and you can see that the t distribution looks like this. Now from this probability distribution function we know that the area under this curve is 1 which means for the data to lie anywhere between minus infinity to plus infinity in this curve is 1. So that is 100% probability that the data will lie but that is not very useful. So for example if you wanted to know 95% of data for example will lie in what range and it is symmetric about 0. So if you take off 0.025 on this side and 0.025 on that side the remaining region will give you the probability that 90% of the time the data will fall in this range. So that is what we are using to give the interval estimates to know it little bit better let us do the other plotting. So we are going to have two plots both are from minus 3 to plus 3, one is a t distribution with 19 degrees of freedom, the other one is normal distribution with standard normal with zero mean and one standard deviation. And what we are plotting is the cumulative probability distribution so Pt and P norm that is what we are plotting and we are going to draw two vertical lines one is 0.025 other one is 0.975. So let us plot this. So you see that first thing is between t distribution and standard normal distribution there is a small difference. So let us zoom this and see. So you can see that in the case of t distribution 2.5% of the data will fall up to this value whereas in the case of normal it is slightly greater than minus 2. Similarly on the other side 97.5% of the data will fall from minus infinity to this value which is slightly greater than 2 but it is slightly smaller than 2 for the normal distribution. In other words in t distribution and in standard normal distribution between these two lines 95% of the data will fall because remaining 2.5 is in this tail and in that tail. Similarly for standard normal 95% of the data will fall in this range and 2.5 is falling here and 2.5 is falling here. This is how we determine the confidence interval when we say the probability with 95% confidence that the mean will lie in this range. So this is the range that we are trying to calculate from the given data and then we are giving that range saying that okay so the mean should lie in this the true mean is somewhere and our data mean is somewhere but that distribution is because we are sampling from the distribution. And depending on whether we know the standard deviation or we do not know the standard deviation we use either normal or t distribution. So that is what the idea behind determining the interval estimate is. So we are looking at the conductivity data and using the t distribution and standard normal distribution it is possible to estimate the confidence intervals for the true mean of the conductivity data in what range will it lie and to do that we are going to use the same idea. So if we are given some confidence interval let us say we want to make sure that 90% probability the data should lie in this range or 95 like we have taken which means there is a 2.5% probability here and 2.5% probability here and same is here. So which means 5 into 0.5 so that should be the value that you should calculate to know what this point is and 0.025 on this side you should calculate to know where this point is in the t distribution and in the standard normal distribution. So and if you do then if that alpha is 5 for example 100 into 1 minus alpha so alpha is 0.05 so that is 95 so 95% confidence level you can give this. So in order to do that let us take the first the conductivity copper data and let us do it here. So you can see what does this computation do we are going to read the data and we are going to calculate the average value of the data. We know how many data points are there and then we are going to calculate the standard deviation from the data itself so this is the standard deviation from the data. Once we have the standard deviation let us say we want to know what is the 50% probability that the mean will lie what is the range in which the mean will lie if we say 50% of the times we have to be right that means the alpha is 0.5 and we will multiply this by 0.5 because we want to have half on this side half on that side right 50% is in the middle so there is 25% on the lower end and 25% on the upper end of this tail portions. So that is why this 0.5 multiplication is there and we are using t distribution with n minus 1 degrees of freedom because we have calculated the average so 1 degree of freedom is gone and t distribution because we are using the standard deviation which we have estimated from the data itself. So you can get and you will see that the A and B values so 101.3045 to 101.3355 in between this the mean will lie and that is with 50% probability of course you can make this higher by giving other values. Suppose if you wanted to have 90% probability then 10% is the alpha value and so 0.05, 0.05 on either side so this 0.5 multiplying 0.1 will take care of that and if you do that so if you want to have higher confidence that you want to make sure that not just 50% of the times but 90% of the times the data should fall obviously if 90% of the times it has to fall these points are going to become wider. That is what you see 101.28, 101.35 as compared to the previous value where it was 30 and 33 so and 28 to 35 so the range has expanded and you can do it for other values also. Suppose if you take 0.05 that is basically the value that we have calculated here so that is the 95% confidence interval that will be between 101.27 and 101.367 and you can of course calculate still further suppose 99% confidence interval you want to get and you will find that that is between 101.2557 and 101.3843. This is all assuming that it is t-distribution but you can also calculate by assuming that it is normal distribution in which case we do not need all these calculations we are just going to assume that s is 0.1 let us say and so we do not need any of this. We have the mean, we have the standard deviation, mean is from the data standard deviation is assumed to be known and then in that case we should not use q, t we should use q norm and you know this is standard normal distribution so it is 0 mean and 1 standard deviation. So you can see this is 101.26 to 101.37 it is 2538 so obviously t is slightly larger interval than normal so 38 has become 37 and 25 has become 26 so normal obviously has a much shorter interval a little bit shorter it is not too much short but it is a bit shorter. So you can do the same thing for normal now with 95% or you can do for 90% as you can see you know 95 to 90 if you go 2736 becomes 2835 so if you are okay with 50% confidence interval for example let us say we want 0.5 then we will be between 101.3 and 101.33 so in this way we can estimate the interval in which the mean will lie with any given level of confidence. So to summarize we have looked at getting estimates from the data for the probability distribution and there are two that you can get point estimates which is the mean and standard deviation and they can be obtained from the average of the data and spread of the data. But in addition if you assume that you know the distribution from which the data is coming you can also give confidence levels for the value that you are estimating you can tell with so much probability the true mean will lie in this range. Specifically we have looked at the case of standard normal and t distribution and standard normal when the variance is known t distribution when the variance itself is also calculated from the data. Finally there is also a way to estimate these relative error in standard deviation which is useful when we are reporting the numbers because typically we report the numbers as mu plus or minus standard deviation and if there are errors in standard deviation which we know from the data we should accommodate that also when reporting the value and we have seen one example. So we have used this copper conductivity data throughout and we have done all these calculations to know how the point and interval estimation works. So we are going to continue with robust estimation where we do not want to assume anything about the underlying distribution of the data and they are rank based there could also be bootstrapping methods. So that we will discuss as part of this session as part of this module in a different session. Thank you.