 Welcome to dealing with materials data. In this course we are going to learn about analysis, collection, analysis and interpretation of data using materials data sets as examples. And specifically in these sessions we are going to learn about using R to do this. And this is the second module, this is the module on descriptive statistics. And in this module we are going to learn about presenting experimental results. And specifically we are going to prepare rank and property based reports of experimental results. And so we are going to take some data sets and prepare these reports and we will learn how to prepare them and as well as how to report them. So that is what this session is going to be. As the first example we are going to use the electrical conductivity of ETP copper. ETP copper is electrolytic tough pitch copper and it is very pure copper, it is commercially pure copper and it is used in many practical applications where its conductivity is very important. So typically in the industrial setting the conductivity is measured using the eddy current method and the units of the conductivity is given in percentage IACS, IACS stands for international annealed copper standard. So with respect to this standard what is the conductivity that is measured in the sample that is what is given in this conductivity measurements. And so we are going to consider the data on ETP copper conductivity and these measurements were carried out by Dr. N. Harshavardhana and these are reported in his PhD thesis submitted to IIT Bombay. So we are going to use this data set and so this is one measurement, this is some 20 times he has measured in different parts of the sample and reported as the table. So that is the data that is given here. So using eddy current method and what is reported is percentage IACS and so 20 different measurements gives 101.4, 1.3, 101.3, 101.4 and so on and so forth. So this is the raw data and this is the most complete reporting of data. We have made 20 measurements and each of the measurement value is given and typically it is also given in the same order in which the measurement has been made. So first measurement is 101.4, second measurement is 101.3, third measurement is 101.3 and so on and so forth. And if these are made on different parts of a sample and sometimes you can even give a schematic of the sample and locate where the first, second, third, etc. the measurements are made. These are all made on the same piece of copper and so the idea is to do these measurements to get the conductivity of the sample and as usual we want to deal with this data. So we want to store it as a CSV file and that is done we have a file called ATP copper conductivity dot CSV and that is the file that we are going to use when we are going to do the R programming. So once the data is given of course the most complete is to just present the table and this table is actually there in Dr. Harshwar Nath's thesis so he has even reported. But it is not possible to keep reporting numbers like this for every measurement that you make right. His thesis for example contains so many conductivity measurements and this is the first and only time where the complete measurement is given. Just to give an idea to the reader as to how these measurements are made what the numbers look like and when we report our means and standard deviations and so on and so forth later people can understand what is the type of data that we are dealing with. So that is the purpose. So it is always better to give the complete report all the numbers if possible. But many a times it is not practical. So you will see in Dr. Harshwar Nath's thesis that apart from this there are not many places where the repeated measurements values are given. And because it is not practical to keep giving numbers like this we are expected to do data reduction and report the reduced data. But it is important then also to tell how the reduction itself has been carried out. So you have to give the reduced data along with the methodology that you use to reduce the data. So that is the most common thing that is done in scientific reports, thesis, papers and so on so forth. However, having said that the current standards are also changing nowadays when you do a data reduction analysis and report the reduced data with the methodology you are also expected to share the raw data that is like the data that we saw in the table the complete data along with relevant codes and scripts etc that you have used. This is typically given as a supplementary material and this is a good practice and as you will also see later we are going to use some of these data that is there in the open literature for carrying out our own analysis to understand some of these methodologies to learn about dealing with materials data. It also allows others to actually carry out the same analysis or if they have a different methodology then they can apply it to this data. So it is very important to have this raw data available and so the most preferred current standard is not just do data reduction analysis and report the methodology but also store the raw data at some place along with the scripts and codes used for the reduction and make it available to everybody so that people can independently do the same analysis or if they need or want do a different analysis and we will also show you some papers in the recent literature which do this very well which do a commendable job of reporting this raw data and those should be the standards we should aspire to. So the first reduced data is the stem and leaf plot and in fact stem and leaf plot has the same amount of information as your table data except that there is still some modification that is done. It loses the information on the order in which the data was obtained. The stem and leaf plot as the name indicates gives the data in a tabular form and it consists of a stem which is a left hand column and there are leaves from each of the stems so we will show for the conductivity data how this looks like. So like I said even though it is a complete report of the results it still misses the sequential information if any in the data. So we are going to order the data and then we are going to plot stem and leaf which means we are going to lose the information about the original order in which the data was obtained. Dot chart is another way of presenting results again it is going to give the complete information but they are nice visual summaries of experimental data and they can be used to identify outliers and if you have more than one data set they will also reveal the differences between the different data sets. So we are going to see examples of both as we go along in the course and the third way of presenting data is to give cumulative distribution plots. Here again we order the data in increasing values and we obtain the cumulative sum and we plot and there are several ways of doing cumulative distribution plots you can use your own script or you can use inbuilt calls like ACDF or plot.ecdf and so on to get the cumulative distribution and you can also do slightly involved plotting in ggplot which helps us change the scale of the y axis in this cumulative distribution plots to probability scale to know whether the data follows the normal distribution or not. So this might be important in some cases we will see examples of that and we will also so there is more involved analysis that one can do this is just first step changing the y axis to probability scale but we will do we will see an example of how to do this and the next set of plots that we present are histograms and box and whisker plots. So you can bin the data and present histograms in this range how many measurements have shown up or in the next range how many measurements have shown up and so on and so forth. These histogram plots are very very important especially if the distribution is not normal or is not what people expect or if you want to give explicit information about the distribution then histogram plots are important. Box and whisker plots also have similar information they indicate the distribution of the data and you can also use commands like a quantile to get the spread of the data. And finally there are also property based reports that one has to make or one can make these are mean median and standard deviation variance and so on and so forth and in these tutorials we will also see in addition to getting the property based reports how to combine the property based reports along with the rank based reports in a graphical methodology. So you plot them you also put this information of the property based reports on the same plots and that gives us better information about the data or help us understand the data better. So we are going to do all this. So this is a session on reporting rank and property based data. So we are going to use the electrical conductivity of ETP copper as the example case for doing all this analysis. So let us do that. So first thing to do so let us open our and let us read the data and make a stem and leaf plot. Well it is very easy so we are going to read into x the data on ETP conductivity and so let us first do that. And then we are going to say x so this is data. So I am going to say x we are going to save in small x the conductivity data. So now if you say stem x you get this stem and leaf plot as you can see the decimal point is one digit to the left of the pipe. So pipe is here so decimal point is here that we know the data is all 101.3, 101.4 etc. So 11.1, 11.2, 11.3 etc. And you can see that there is one data point 101.1 there are 3, 101.2, there are 3, 6, 9, 101.3 and 5, 101.4 and 2, 101.5. So if you add them all up there is total of 20 data points 4, 9 plus 4, 13 and plus 5, 18 plus 220. So this is the stem and leaf plot and this is called the stem and this is called the leaf and you can also see sort of the distribution of the data. So you can see how the data is distributed and it looks like a normal distribution at least looking at these numbers. So this is the first one so get a stem and leaf plot. So like I said stem and leaf plot is complete in the sense that whatever 20 data points that we saw are all here except that now they are ordered this information that you know 17 measurement gave 101.4 and then 18 and 19 gave 101.3 and 20 gave 101.1 that information is missing here the sequential information if it is important for example 0.5, 0.4, 0.3, 0.1 is there some reason why if you make these measurements it will reduce like this. If there is any such information that is missing from the stem and leaf plot otherwise it has all the data. So it is complete in one sense that it has all the data and it also by looking at it you can not only see how the data is distributed you can also see which is the most repeated numbers the mode of the data. So that you can see clearly. Now let us do the dot chart. So dot chart of x. Now here is the dot chart and again 101, 101.2, 101.3 etc. So you can see two data points here five data points here nine data points here and three data points here and one data point here. The dot chart is also plotted in such a way that it tells you about the values and 101.1 is shown like this because it is sort of an outlier and dot charts are useful to identify such outliers and so we will see why this is an outlier we will see later. But at least by looking at the dot chart again dot chart also gives all the information it only loses the information about the sequence of the measurements. But in addition it also sort of shows you where the outliers are and in this case this happens to be the outlier. The next step is that so we have done the dot chart. Let us do the cumulative plot and this is how the cumulative plot is done. Let us put it here. So we are going to sort the conductivity and decreasing is false. So it is going to be in the increasing order lowest to highest and we are going to get the cumulative sum as y of x and we are going to normalize. So length of y will give you what is the number of elements in that vector. And so the last point of that because it is a cumulative sum you know final sum will be the total and that we are going to divide by so that the values go from 0 to 1. And then we are going to plot the x value and the sorted conductivity value with the cumulative sum normalize cumulative sum. And the type is basically a step like plot and that is what type equal to s means. So let us do this and look at. So as you can see oh no one these steps is what because we said type equal to s and this is the conductivity values and the cumulative sum of the conductivity values are here. So this is the CDF. There are other ways of plotting this. For example you can say plot dot ECDF that stands for Empirical Cumulative Distribution Function I think. So you can say CDF okay so sorry plot ECDF x dollar. So you can see it is the same plot as we got except that I mean the plotting style is slightly different and it shows you the Empirical Cumulative Distribution. So you can look at help ECDF. So it is Empirical Cumulative Distribution Function. So that is what we got. There is another way of course you can say plot and what should be plotted we can say Empirical Cumulative Distribution Function of conductivity right. So again it is the same plot so either you can say plot dot ECDF or plot ECDF of this okay. So there are 2, 3 different ways of doing it but they all give you the same result namely that you have the cumulative distribution plot from the data. Now we want to use GG plot we want to change this range to probability scale. So obviously we have to use library GG plot and scales and so let us do that. So we invoke the library GG plot 2 we invoke the library scales and as usual GG plot you have to tell the data we have to tell the aesthetics so conductivity is all that is there in the data so that is what we want to plot and we want to do the statistical analysis namely cumulative distribution function and that is what we want to plot and the scale of y we want to transform to probability scale and the probability scale if it is a normal distribution then how the data would look like and what does the actual data look like okay. So that is what we want to compare and that is why we want to do this scale transformation so let us do this and you have this. And if this is sort of like a straight line that indicates that the data is having a normal distribution but anyway there is a warning message so transformation introduce infinite values in continuous y axis but I do not think that is very important but it is important to read and pay attention to them in this case this is not important okay. So as you can see the scale is now slightly different and so it basically tells 25 percent of the data is here 50 percent of the data is here and 75 percent of the data is here and 100 percent data is here so it sort of gives you not just the data but the distribution of the data okay. Now the next one we want to do is to do a histogram plot right so let us do the histogram plot. So we take x and conductivity of course the x label is conductivity and the title is copper conductivity and you can see the histogram plots. Histogram plots again are very nice so you can see the distribution of the data so there are there is one data point three nine five and two that we already know and so that is how the data is okay. So this is a histogram plot now the next one that we wanted to do is to do a box and whisker plot again the command is very easy and as we have been doing it is easy to actually put help box plot for example and learn more about these commands. So here is a box plot and the box plot again indicates where the data is where the sort of median data lies and so on and so forth and we can also do so what this has done is that it just flipped the see the original box plot was like this and this 101.1 is an outlier as you can see here again and we can flip it to be horizontal so the conductivity values are here and this means that this box actually has the third and second and third quantile data and so this is the median value and you can see that there is one data point which is really lying outside so this is an outlier 101.1 is an outlier we got an hint of this earlier too from the dot plot and we will see why that is so later. So now let us also do this you can also say quantile this is not a graphical representation but it will give you the numbers so it says that 25% of the data is achieved at 101.3 50% also at 101.3 and 75% of the data is actually 101.4 and 101.5 so this is the quantile how much of data is in which range you can increase so it gave you only 0, 25, 50, 75 etc. So you can decide that you want to have more information than that by explicitly giving that. So here we again want quantile but instead of the default 0, 25% etc. we want to go in 10% so 0, 10, 20 etc. so you can get the quantile. So this again gives you some information about the spread of the data and these are all the rank based ways of representing the data. So let us now calculate the summary reports from property based reports. So the mean is let us call it mu and the median. So you can get mean to be 101.32 and median is 101.3. Now you can get the sigma which is the standard deviation and variance which is war. Let us do that. So SD is for standard deviation and we are going to store it as sigma mu is the mean sigma is the standard deviation and this is mu minus 2 sigma and mu plus 2 sigma is the 2 sigma range about mu and the war is basically variance which is this. So the variance is 0.01 and you can get sigma which is 0.1. So now let us do the final thing. Let us plot and let us also put this property based reports together with the. So what we are going to do? We are going to plot the conductivity and we are going to start drawing lines at the mean at the median and mu plus sigma mu minus sigma mu plus 2 sigma mu minus 2 sigma etc. If you do that you can see that okay so here is the data and the black line is basically the mean the median is 101.3 so that is the red line and this is 1 sigma from the mean. So these data points are lying within 1 sigma and if it is a normal distribution and we have been thinking that this is a normal distribution you would expect that large percentage of data 99% of data should lie between 2 sigma about the mean. So and you can see that it falls but there is one data point that lies just outside of the 2 sigma. So this is the 2 sigma line mu minus 2 sigma that is why this is an outlier and this has been indicated in the dot chart and in the other plots also not so much in histogram plot but in the box plot we did see that this is an outlier. So and here also we see why it is an outlier. So this is another way of looking at the data. In this we have both put the data as well as the property based reports and things like histograms and cumulative distribution etc will be called as rank based representations. So we have taken a simple data set on conductivity and we have prepared both rank based and property based reports and we have learned how to present them using R. So we will continue with some more data which can be little bit more complicated than this in the sessions to come. Thank you.