 Welcome to dealing with materials data, in this course we are looking at the collection analysis and interpretation of data from material science and engineering. And we are in the module on data processing and in this session we want to look at the distribution function of a data series. So you have a set of numbers, you have some data that is available to you and you want to say something about the distribution function of the data that you have got. So there are two things that we are going to do, one is the ETP copper, we will take the conductivity data and the first thing when you take data like that is to plot the histogram. Histogram gives you an idea about the approximately what the probability distribution function that is sampled by the data. Of course in the case of copper conductivity we know that it looks like normal. It is not surprising because it is the same material and we have made more than one measurement and so the noise is random and if that is so, if you have random errors or noise then you do get normal distribution. But if you have very small number of observations, a histogram can be noisy and it might be difficult to judge the probability distribution. In those cases it is better to display the cumulative distribution function. So we are going to do this exercise for the ETP copper, we will do all this. But there are sometimes complications, it is not uncommon to not have access to raw data. So in this case for example, I have data for all the 20 measurements but typically that is not what is published, it is very rare for people to list out all the measurements that they make and what is worse sometimes you might have access only to bin data, that is data has already been analyzed and values which lie within a range they are all just counted without really telling what exactly the values were and they are just given in the form of a histogram plot. And so even the bin data is not available to you in raw form is what I am assuming. You might have access only to the bin data and only in the form of a plot. Now can we do any analysis on that data is the question, the answer is yes but to do that first you have to generate the data, so you have to go from the plot to the data and then from the data we can go back and do the analysis and how do we do it. So at this point a disclaimer is necessary, there are too many different ways of doing or achieving this and different people have different preferences but based on my preference choice and experience. So I am going to give you an introduction to a couple of tools, introduction in the sense that I will just show how it is done. One is called Engage Digitizer, this is used to read data from image. The other one is called GIMP, this is to pull out the data from a PDF file into a JPEG file in the form in which you can then feed it to the digitizer. Of course you can use LibreOffice to get this data from Engage Digitizers after reading the data into LibreOffice directly or you can do it by hand. And in the following exercise that I am going to show you the data was entered by me by hand but of course I will show that digitizer also can do it automatically which we are not going to use so much. So in this session we are also going to look at how to use these two tools to generate data from a PDF file of the type which is bin data, that is what we are going to look at but you can use this for everything. And specifically what we are going to do is to consider the supplementary material to the paper, surface diffusion driven nanoshell formation by control sintering of mesoporous nanoparticle aggregates by Anumol et al. It is published in nanoscale and there is figure 14a which gives the data on cluster size distribution of synthesized titanium and it gives it in the form of a histogram and that is what the data that we are going to read out and we will use, we have used the digitizer and the GIMP to do this. So I will show you the data and we will do the analysis on the data but I will also show you how to use GIMP and Engage by taking figure 14b as an example. So that will be part of this session and when you have been the statistical data like that there is data with different statistical weights, why is that so? In the case of ETP copper conductivity data each point had equal weight 1 by n where n was the number of measurements because all data points were just measurements. But when you have binned data the statistical weights of different bins are different. For example in the case where we are looking at this titanium size the cluster size data was given in bins of size 20 nanometers. So we do not have access to the original data and if you look at something like 200 nanometer cluster size it means all clusters of sizes 190 to 210 were put in this bin. So we need to give a statistical weight W i to each of the bins and W i is nothing but the frequency in that bin divided by the total number of observations. W i is the number of observations in the i bin to the total number of observations. So this is the way to give different statistical weights and it becomes important in such bin data. Like I said it is not very uncommon to see such bin data being published. So sometimes if you want to do analysis with existing data that is there in the literature this kind of exercise becomes essential. And so then we plot cumulative distribution in the case of conductivity data for example it is rather straight forward to plot the cumulative distribution function ECDF function will do and we have already done it once but we will just repeat for the sake of completion. In the case of binned data of course you have to get the cumulative distribution by considering the weighting factors. So i is an indicator function it takes either unity when the condition is true or 0 when the condition is false. So what is within i the argument is a condition and if the condition is satisfied it will take unity and it is not satisfied it will take 0. So we will consider the cluster size data and we will calculate the cumulative distribution function using this because cumulative distribution function x means the probability that it is less than or equal to x itself. So and then we will plot it on the normal probability scale and the plot will be a straight line if the data is normally distributed. So that is how we know what is the underlying distribution. So you can look at the cumulative distribution function and make out easily what type of data you have. Of course you can also get it from Instagram data if the data is good you can make out fairly easily but otherwise it is an approximation because noise can throw you off. And there are places where it is very difficult from Instagram to actually understand for example log normal or Weibull might be very difficult to distinguish. So the cumulative distribution function is a slightly better way of understanding the distribution from which the data comes. So let us go do this exercise. So first we want to take the couple conductivity data do the analysis then we want to understand how to take a PDF file specific figure you want to cut out and then generate a JPEG out of it in such a way that it can be fed to digitizer and in digitizer then we need to know how to read the values which can then be entered into LibreOffice and you can have a CSV file which can be used for further analysis. So I am going to do that as the second exercise to show you how to do this digitization and followed by the reading of numbers from such figures and finally we will take such one such bin data and do the analysis on that data in this session. So as usual the first exercise to do is to start with R so it is a good idea to know what is the working directory so we are in the right directory and R version is 3.6.1. So the first exercise is to read data and plot the histogram so let us do that. So we want to read the ETP conductivity data and then we want to plot the histogram. So of course you can look at the histogram this is a frequency versus data plot and it does look like normal distribution maybe slight skewing so this gives you an idea that this might be normal distribution so that is the first exercise and the second exercise of course is to plot the cumulative distribution function. So this is the cumulative distribution function and this also indicates that this could be so this also indicates that this could be a problem with the this could be a normal distribution and of course it is difficult to make out so we would like to make the y axis to be scaled as a probability distribution and then you will find that it is normal and this exercise we have done in the past once. Now let us go to the second exercise let us consider this supplementary data so this is the supplementary data on surface driven diffusion of nano shell formation by control sintering of misoporous nanoparticle aggregates by Anumol et al and it is published in nanoscale so this is supplementary information to the paper and the paper has lots of data and one of the data that we are interested is here so it is a cluster size in nanometer versus frequency so the figure caption says that it is a histogram showing the cluster size distribution of as synthesized titanium aggregates and this is a cluster size distribution for titanium annealed at 600 degree Celsius. So this is the data that I have taken and generated the raw data for our further analysis but I want to show how I did that using this as an example. So first thing we need to do is to take this figure and generate a JPEG file out of it so that this can then be fed into the digitizer. To do that I am going to open this file using the application GIMP, GIMP is a GNU image manipulation program and it tells me which page I should open of course I want only this page so I say this page. Once I have this page GIMP has lots of tools the tool that is of relevance to us is a selection tool so I am going to say rectangle select what is it that I am going to select. So you can mouse click and get this and then you can say copy or cut that portion and you can make a new file and so you need to give the size of the file so the size you can read off from here for example. So this is 250 this is 500 this is 750 so we have about probably 300 or 350 in width so I am going to say 350 in width and you can see in the height also so this is 250 and this is 500. So we again probably have another 350 in height so I am going to say 350 by 350 then I am going to say paste. So you can see that the figure that is here I cut and I pasted it here and then I am going to do some analysis here I am going to transform no I do not need rotations so I just want to make the image a little bit bigger so it will be easier. So let us say that I want to zoom it to some 400 so here is the figure now I am going to save this figure as so let us go here let us go to data and this is let me save it as test.cf so that you can later do further analysis but we actually want the data of the figure to be in JPEG format. You can export the image and as good quality as possible so we are going to export. So the figure has been exported and if you go here you can see that there is this text.jpg now we are going to say now we are going to use this so Engage Digitizer is this and I am going to say file import the file that I need to import is the file that I recently generated that is in the data file here that test.jpg so I am going to open. So the data has been imported here the first thing that we need to do is to identify the x y axis so that the program can read the data so that is the first step so you need to identify three points on the axis the origin some distance along x and you have to tell what that point is some distance along y and you have to tell what that y point is so that the sort of distances on this plot is mapped for the program so it knows if you go some distance how much does it correspond to in the x units or y units. So let us do that so first we want to do that and it is zoom have this so let us say that this is the first point we want to mark and this is nothing but 100 in x and 0 in y. So we know that this is the point and this point is 350 and 0 so let us mark this so it is 350 and y is 0 and then we want to mark this one so we know that this is 0 in x axis and 17 y axis very good. So now the Engage Digitizer knows that this is for example 350 so if I go here it will know that it is 250 so as you can see here if I hover the mouse over these values then it gives those numbers so for example if I want to know what this point is so it tells me that it is about 205 and what this point is so it is about 175 so once we have defined three points is all sufficient now the curve point tool is the one which actually can identify the curve points. So for example let me just start putting points so let us say that I have this point this point this point so the program can actually get the points and you can then import the points so if you just say export and so test.csv then it will save the data and you can open the test.csv okay so it gives you the xy points but this is not very useful for me because I mean of course it puts the curve crazily the curve should actually go like this so what I did instead is to actually hover and read the data like this is 175 and for 175 as cluster size the frequency is about 5 right and for this is 115 and the frequency turned out to be like 2 and this is 45 and that corresponds to 205 and so on so in a similar fashion so and so this you can use for reading any digital data and it comes very handy so sometimes if you see data in literature and it is not in a form in which you can analyze it you can use this these programs but these are not the only ones so there are other programs that are available. So what I have read I have made such a data set and let me copy that data set so this is the csv file I need cluster size frequency so this is the data that I got 160 to 320 in 20s and 112, 20 to 34 etc which you can see from this figure so it is 112, 22 etc so it is like 160, 180, 200, 220, 240, 260 etc to 320 so this is the data that I have digitized using Gimp and Engage and that is the data that is shown here x versus frequency this is in nanometer the size and the cluster size and the frequency of such clusters so that is what is the data that is given here this is the data that we are going to use now and do our further analysis. So let us see what is the analysis that we are trying to do so first thing you have to read the cluster size frequency data and sum of frequency is just sum of that column and the weight is nothing but the particular number of observations in that bin divided by the total number of observations so that is the weight and the number of bins that we have is given by the length of x because remember it just had x and f as the two columns and then we are going to make a vector called z and it has n rows and we are also going to make another sequence which goes from 2 to n, 2 to n because z1 I am going to take as w1 times the x and then for other z it is the previous z plus wi times that x. So this is for the cumulative distribution and you can normalize by the total so that it goes to 1 that is why the last value is taken as zn and everything is divided by that and when we are plotting we should remember that the x value is actually read it has a spread of 20 so when we have these values we want to plot the cumulative distribution function in such a way that it has a step of that size. So you do not want the value to jump at that value itself but you wanted to jump 10 after that so this shifting is done to make sure that the step is of the right size and I am also adding a point exactly at the center of that step to indicate that that is where we have read the value the step actually indicates the uncertainty in that value anything that lies within that range will actually be wind to that point so that is what this indicates. So let us plot this and see so this is what I said so 160 is where we have taken a data point but we know that 160 actually means 150 to 170 so we want a step and that is the reason why this minus 10 is there so it will draw this line and it will show a step then and wherever the actual values that we have read from the histogram because the histogram is giving these bins 160, 180, 200 etc. So I have put a red point to indicate that that is where we have read the value but this is the uncertainty or the bin size all values that lie within this range are actually clubbed and put it at this value. So this is the plot so obviously it will be more helpful if you can of course we have seen the data in histogram form and you can also plot this one to actually see the histogram and in fact if you want to get the histogram field you can also make it type so you can see that the heights of these lines are equal to the frequency so it does show you this nice histogram which was what was there in the data itself. So we have seen this data and so this is basically the same thing so we have read this values and we have plotted it here. So we can do that the more useful thing is to of course do the y axis to be a probability scale so we are going to do that. Okay so let us take a look at the command that we have so let us go back here. So it is the same some the frequencies make the weights find out the number of data points and generate the cumulative and then normalize it and then plot it. So that we have done already but the only extra thing is to use ggplot now and use that to scale the y axis so that is the command that we have done here. So we have made this new data called y and what is y? y is just a data frame it takes x as x and xx as x minus 10 because we remember we wanted to make these steps y as z by z norm so that the values will go to 1. So this data frame now we take and plot and while plotting we have the ggplot so it is xx axis y so the geometry is of step so it will do this step plot and then we scale the y axis to be normal probability scale then we just add the points at these centers like we did in the previous case. So you can see that this is more or less sort of straight line indicating that this data might also be normal. So to summarize we can have data and if it is raw data we can deal with it directly. If the data is only in published papers if you have access to the PDF files it is possible to generate the some form of data from those plots yourself. There are lots of tools that are available online for you to do that and both the GIMP and Engage Digitizer that I showed you are freeware so you can freely download them and use them on your computers to get the data from the paper into a digital format. So you can then use LibreOffice which is another freeware to have the data entered in CSV format. Once you have of course the data in CSV format you can use R to do all the analysis and we have shown one example of how to plot the histograms and cumulative distribution functions using R from a data that is given only in the form of a histogram plot. And the histogram plot also tells us how to deal with data which has different statistical weights. All the points in that data do not have the same weight. Some values for example have frequency 1 or 2 at some values there are about 50 or 60 and we know that when we have one bin all values which lie in the range of that bin are actually bin bin to that single bin. So there is an uncertainty in the numbers when we say 200 nanometers it is 190 to 210 nanometers any aggregate in that size will actually be counted in that bin. So we need to give different weights and we have to understand that each bin has an uncertainty in terms of the actual value itself but this is very common. I mean hardly ever you will get raw data of all the cluster sizes for example. And so once you have this kind of data using this it is possible to proceed with the analysis and that is an example that we have shown and you will have more exercises this week to do similar analysis from data of a similar type that we will give. Thank you.