 Welcome to this course on dealing with materials data, we are going to look at the collection, analysis and interpretation of materials data, we have already done one module on introduction to R and we are learning now how to use R to do descriptive statistics and in this we are going to look at how to present experimental results and we have already seen how to present experimental results taking the conductivity of ETP copper as an example, so there were 20 measurements and we presented those measurements in many different ways and we have also found those are rank based reports like histograms and dot charts and things like that and then we also made the summary of the data, we have also prepared summary based reports like mean and standard deviation and variance and quantities like that, quantiles like that. So in this session we are going to look at another very common data that you would see in material science and engineering which has a slightly different character compared to the previous data that we looked at namely the conductivity of ETP copper. So let us take a look at that and when we looked at copper conductivity data there were about 20 measurements and each was slightly different due to errors and inaccuracies and the data itself was a normal distribution and the mean was the value that we reported as the conductivity and standard deviation said how much spread is there about this mean in the data when you do experiments, if you repeat the experiments what are the values and how much away are they from the mean. So these are the two quantities that completely described the data. So we reported the conductivity itself as mean 101.3 plus or minus standard deviation it was either 0.1 or 0.1% in this case both the relative error and absolute error happened to be same numerically but you can report it in either ways. However sometimes a single measurement leads to a distribution of values grain size is an example. So we will look at what is a grain and how do we determine grain size. So that is what we will look at now. This shows the example of polycrystalline copper and the colors represent different grains. So this is basically the grain structure in polycrystalline copper. This is the same electrolytic tough pitch copper on which the conductivity measurements have been made by Dr. Harshwaradhana. This is actually taken from his thesis and there is a color triangle here and this triangle tells you that if you see a grain which is colored red for example that means that perpendicular to the plane of this screen this grain has 001 family of planes the normal is 001. Similarly if you see any blue colored grain that means that that grain is oriented in such a way that the in the plane of this screen perpendicular to that is the normal of 111. So this is basically 111 plane and the normal is normal to the plane and so that is what is colored as blue. So near about these then the colors slightly change. So anything that is blueish basically means that the normal perpendicular to the plane of this figure is given by that family of planes and so this represents the grain structure and this is for 101 family we use the green. So you can see that it is mostly blue and some red and some green yellow and things like that. So this is the grain structure. Now in order to say what is the size of grains in a material so you can see this is one single measurement it is one single micrograph and it gives you the grain shape size and distribution and if you look at the different grains they all have different sizes for example there are some intermediate sized ones there are big ones and there are very small ones for example this is very small. So you can see that there is a variety of grain sizes that are coming out of one single measurement and this is very very common. So most of the materials are polycrystalline metallic materials and alloys are polycrystalline and one typically measures the grain size and as you can see it is not sufficient to give a number and a standard deviation as in the case of conductivity because in the case of conductivity most of the values were lying slightly away from the mean and that deviation was because of random errors or uncertainties in our measurement but that is not the case here the grain size itself is distributed and so we need to give this information so sometimes just giving the mean and standard deviation might not be sufficient or might not represent the true nature of what you are measuring or observing. So there is this concept called ASTM grain size and it is basically a number indicating what is the grain size in a material and it is defined as follows so you take the microstructure under 100 magnification and in that 100 times magnification microstructure you take 1 square inch count the number of grains and ASTM grain number is n if there are 2 to the power n minus 1 grains in that square inch. So we basically take a micrograph and we make sure that the magnification is 100x and then we take 1 square inch of that and count the number of grains and based on that then we give a number and this is called ASTM grain number for that microstructure or for that material and this is described in detail in Raghavan's book on material science and engineering for example. So we have this grain structure and we have the grain size measured by the ASTM grain number and I am going to show two data sets. Both data sets give grain sizes in some steels and these are two different samples of steels and this data is generated by Mr. S. Purnachandra who is a PhD student at IIT Bombay and he has given me this data set and the first set has grain IDs and grain sizes I will show you the data set itself we will open it in LibreOffice and C and the second set is slightly more involved because the second set is for a steel which consists of two phases. So in addition to grain ID you also have a phase ID. So it will either say that this is grain of phase 1 or grain of phase 2 and then it will give the size of that grain that is because the microstructure consists of two phases. On the other hand if you look at copper for example it is a single phase everything is copper and then we are getting the grain sizes but sometimes it can happen that there is more than one phase and this is true for most of the alloys that are used in engineering application hardly any of them are single phase materials. So they will always have more than one phase and the second set is given to deal with such scenarios. So you have in addition to grain IDs and grain sizes also the phase IDs. These data files are very big as you will see it is no longer practical to enter these numbers by hand. Fortunately for us these are data files that are generated from the computer so you can save them in the CSV format which is what Mr. Prunachandra has done and given the data files to us for our study. And we are going to load this CSV data file and we are going to do the analysis on that. So the two data sets one is called grain size data set 1 and the other one is called grain size data set 2.csv. So as you might have noticed we want to give names as much as possible which are intuitive and easy to understand and clear to follow. So we are going to open these files in LibreOffice and Inspect. So let us do that. So let me go to the data and let us open grain size data set 1.csv and so there is this column integer identifying grain like 1, 2, 3, 4 etc. And these are the number of measurement points in the grain and the area of grain in square microns is given. So this is also a measure of the grain size. You can give the area of the grain in square microns and then you can also give the diameter of the grain in microns. Obviously as you have seen the grains are not circular or spherical but you can get the equivalent circle of this area what is going to be the diameter of such a circle. So it is possible to give an equivalent diameter for grains that is one way of defining an equivalent diameter but that need not be the only way. But this is again another measure this is an area measure this is a length measure. So we also have the ASTM grain size which is like a number. So you can also give a number measure. So there are three different measures of grain sizes that we see here and one is the area measure the other one is some length measure the other one is a number. And for our analysis we are going to use the ASTM grain size with the integer identifying grain. So this is one single data set and it already gives you large number of grain sizes and so we are interested in looking at the distribution. So the data itself is distribution. If you look at the second data set it is very similar to the first one except that now there is an extra column which is called phase identity. So it again after giving the phase identity then it gives integer which identifies the grain and the number of measurement points in the grain and area of the grain diameter of the grain and ASTM number. By the way the number of measurement points in the grain should also be proportional to the size of the grain because if you are taking measurements at periodic distances then if you have larger area you will have more measurements. So this is also at some level another measure of the size of the grain. But as you can see these data files are too big right. So for example the grain size 2 if you go down and you can see somewhere of the order of 3600 data points are there. And similarly the first data set that we had the grain size 1 it is not that big but it is still reasonably big and so I think this has about 480 or 500 data points right. So we have about 486 data points. So obviously generating such a data file by putting data by hand into R is not practical and it is also not meaningful because manual entry can introduce its own error. So this data comes from the computer and it is stored as CSV so that we can import this data into R and start working with it. So we are going to try all the descriptive analytical tools that we learnt while dealing with conductivity in looking at this grain size data for both the sets. It is always a good idea to just plot the data to have an idea of what the data looks like. Of course you can open in LibreOffice and Inspect but that is still very cursory and you can try to get a overall picture of the data just by trying to plot this data. So we are going to do that also and we are going to mostly use ASTM grain size for our exercises. So any of the measure of grain size can be used but we are going to stick to ASTM grain size for this session. So for the dataset one let us do this rank based reports. We have learnt about several rank based reports, scatter plot, stem and leaf plot, dot chart, cumulative distribution, histogram plot and box and whisker plot. So we are going to do all these rank based reports and for the dataset one we are also going to do the property based reports. Those are mean, median, standard deviation, variance and quantile. And of course finally we are going to plot the data and we are going to indicate these property based values on the plot to have a better understanding of the data. So that is what we are going to do in this session. So for dataset one and we will come back to dataset two in the next one. So as usual to do the data analysis we have to open R and we have to look at the R version. It is 3.6.1 action of the toes and we have to find out which directory we are that is by using get working directory. So we are in the dealing with materials data directory. So we are going to load the data and so to load the data we need to know which are the data files. So let us just look at the files in the data directory. So if I say data, so there are all these files and we are interested in grain size dataset 1.csv and 2.csv. So first we are going to deal with grain size dataset 1.csv. So let us do that. So let us load the data into the variable X and importing is done by using readcsv because it is a csv file and we are going to say from the data directory and it is grain size dataset 1.csv. So let us read it. So as you can see immediately R tells you that there are 485 observations and 5 variables which we have already opened in LibreOffice and saw. So let us get some more information on the X. So it is a data frame. It has 485 observations and 5 variables and those 5 variables are listed here integer identifying grain, number of measurement points in the grain, area of grain in square microns, diameter of grain in microns and ASTM grain size. And you can see that integer is for example int it goes as 1, 2, 3, 4 etc. And the number of measurement points is 3769, 130 etc. Area of grain has some 321 levels 0.00624, 0.01935 etc. And diameter of the grain again is a number it is 0.111, 0.75 etc. And as you can see 0.11 and 1.75 so number of measurement points is 3 and 769. So that is also sort of consistent with what one would expect. And the ASTM grain size is given 23.7, 15.7 etc. So the easiest thing to do is to just you can you know without opening even in LibreOffice. So you can say head x for example. So it will give you the first few lines 5, 6 lines. So you can do and you can also do the similar command tail to look at the last few lines. So this is another way of taking a look at the data but this is not the complete data. So one of the easiest ways to get the complete data is to plot x. Like I told you last time when you say just plot x for a data frame it makes a table of plots. So it takes each of these variables there are 5 variables and it plots each against all the other variables. So 5 into 5 there are 25 boxes that you can see and so you see 4, 4, 4, 4, 4. So 20 plots are there. And our interest is with the ASTM grain size. In fact we are interested in looking at the integer identifying grain and the ASTM grain size. So that is what we are going to plot and see. So let us do that. So let us go and look at. So this is how the figure looks. So this is a slightly bigger picture. So you can clearly see what is there along the diagonals and how the data plot looks like. So the first thing that we want to do is to make a scatter plot. We want to say that integer identifying grain against the ASTM grain size. So let us just plot that quantity and see. When I plot that quantity you see that there are lots of data points quite close to 0 here and lots of data points are somewhere about 37,000 or 38,000 right. We know that there are 485 observations and all of them are clustered in two places and rest of it in the middle is empty. So this picture is really not very helpful for me to understand how the data looks. So I want to understand why and because everything is clustered around the first variable near about 0 and near about 30,000 let us just look at what this integer identifying grains looks like. So if you do that of course you can see that the numbers initially start as 1, 2, 3, 4 intuitively that makes sense and somewhere about 190 suddenly there is a jump to 37,640 right and that is why after 190 you do not see anything before 37,600 and odd. So which means that it is not meaningful plot to plot data like this. So for the scatter plot let us try to remove this gap in the figure and try to make a scatter plot. For doing that we have to use a library and that library is known as plot tricks. So let us take these two commands and let us put it in our. So library I am going to use the plot tricks library and plot tricks library allows to you to plot a gap plot. So x1 and x5 that is what we are plotting the first column versus the fifth column but there is a gap and the gap is I am telling that okay the gap is between 230 and 37,600. So those data points will be left out and the gap axis is x because that is the x axis and I want to leave out in x these points and for the rest we are going to plot right. So if you do that of course now the data is easier to visualize but there is a difference you know when we plotted see the x was labeled 1000, 0, 10,000, 20,000, 30,000 etc. But when you do the gap plot the tick marks have disappeared so we have to get them back right. So we will do that there is a way to get the tick marks okay. There is also a way to get the other information right. So okay so let us go back and do this. So of course you can get the title this is the so the labels also says x little ones y little ones etc. So let us change let us say let us call this plot as grain size versus grain ID plot and x label should be grain ID and y label should be ASTM grain size right. Let us do that. So we have grain size versus grain ID plot ASTM grain size versus grain ID. So that is what is given and that is what this command is and you can see right grain ID versus okay. So now let us also introduce the x tick marks okay. So I am going to cut face this command okay. So let us look at this command again it says this gap plot one which is the first column ID grain ID five which is the fifth column which is the ASTM grain size and we are plotting and it is a gap plot. So we are saying that there is a gap in x axis and the gap is from 230 to 37 600 and so the plot is called grain ID and ASTM grain size plot x label is grain ID y label is ASTM grain size. Then we are saying introduce x tick marks and the tick mark should go from 0, 100, 200 that is this part and then 37 600 onwards up to 38 400 because we can see that the data is up to 38 3 on 4. So 38 400 should about cover the entire range. If we do that of course we have the complete data now plotted and the figure looks very neat and professional now okay. So you can save this figure okay. So let us do that we have done it already. So we will do once more this is very common. So we want to save it as a PDF in the figures directory we want to call this as grain size scatter plot dot PDF and that is what the name of this file be and we are just going to give the plotting commands and device of to tell or to close this PDF file and come back to showing figures to you on the screen right. So we do and there is a plot that is generated. So we can go to the figures directory and see that there is a file that is generated this is a grain size scatter plot dot PDF. So you can look at the properties and you can see that it is just generated now okay. So this is the plot that is generated grain ID versus ASTM grain size okay. So what is the next step? Stem and leaf plot and dot chart and other measures. So we will do that next.