 Welcome to dealing with materials data, we are looking at the collection analysis and interpretation of data from the material science and engineering. And we have done two modules, we have done an introduction to R module and this module is meant for doing descriptive data analysis using R. And we have already done analysis on two sets of data, one is on the conductivity of electrolytic tough pitch copper and we also did an analysis on grain size. And we found that the conductivity measurements and the grain size data are of two different types. In the case of conductivity measurement, the repeated measurements just gave errors about some mean value because those are measurement errors or random errors or uncertainties associated with the experiment. But on the other hand when we measure something like grain size, it has naturally a distribution, not all grains are of the same size, they follow a distribution and the distribution is not normal or Gaussian or anything like that, it is slightly more complicated. So it makes sense to represent the data in this case not just with mean and standard deviation like we did in the case of conductivity, but also by plotting something like a histogram plot to indicate how the data looks like. So let us now continue with the second data set, it is also data set which deals with distributions and this is the grain size data set 2.csv and this is slightly complicated data set because it contains grain sizes of two phases and we are going to carry out all the rank based and property based analysis for both these phases and in all the cases we are going to do it one next to the other. So we can have information about grain sizes of these two phases, but we will also have a comparison between the grain sizes of phase 1 and phase 2 in terms of their rank based and property based summaries. So that is what we are going to do. One of the things that you have to carry from this session is that if you just looked at the grain size like we reported conductivity just by reporting the mean and standard deviation, you will see that grains size of grains of phase 1 is 24.1 plus or minus 0.4 and grains size of grains of phase 2 is 23.4 plus or minus 2. Sometimes students make a mistake of thinking that these two grain sizes are different and grain size of grains of phase 2 is smaller than 1 that is not true because you also have to take into account the fact that there is an uncertainty. When we say 23.4 plus or minus 2, it means that the number could be anywhere between 21.4 to 25.4 and so the 21.4 to 25.4 actually covers 24 also. When we say 24.1 plus or minus 0.4 that means it is 23.7 to 24.5. So within the error bars all that you can say is that these two phases have the same grain size. However, if you look at the histogram as we will do or look at the quantile information you will see that the mean and standard deviation are not the complete picture. These two grain size distributions are very different even though they might end up giving you more or less the same grain size and that is the part that has to come through this session. That is what we are going to see and we are going to understand. So let us do as usual we want to open R and so we have to check the version we have to make sure that we are in the right directory and so we are all set to now start importing the data and in this case the data is in CSV format. So read.csv is from data directory and grain size 2 is the dataset 2. As you can see there are 3660 phone observations and there are 6 variables, 6 variables because in addition to the 5 variables that you saw in the other case, here the grain, the phase identity is also included before grain identity. Of course you can get more information by looking at the structure of this object X. It is a data frame, there are 3664 observations, there are 6 variables, the variables are phase identity, integer identifying grain, number of measurement points, area of grain, diameter of grain and ASTM grain size. Like we did earlier we are going to be worried only about integers identifying the grain and ASTM grain size, of course for the 2 phases, phase identity is, this is a 2 phase microstructure so there are 2 phase identities 1 and 2 and we are going to be working with these 2. Another thing of course is to plot, we can always try plot X, so it is 6 by 6, so you should have 36 boxes 3, 4, 5, 6 and 3, 6, so 36 boxes are there and so against phase identity, against integer identifying grain etc., all the parameters are plotted and so you can see how the data looks. So this is the first step and this does not distinguish between the grain identities. So let us do the plotting of what is this command. So we want to plot the grain identity versus grain size and we want to color them according to the phase identity. So phase identity 1 should get one color and phase identity 2 should get another color right, so this is how the plot looks. So these are the sizes and these are the grain identities. You can also switch them, which is how it was the original one and you will get this. I am showing this plot because this is closer to what you would see in the case of a dot chart but it is very difficult here you know to distinguish between these black points and red points, however in dot chart they will be separated, so dot chart is always a nice way of visualizing the data instead of plain scatter plot that you can make. Of course you can also play with plain scatter plot yourself and come up with commands which will separate these data but there is an easier way by using the existing libraries. So that is what and here it is much clearer. So there are lots of red points and fewer black points and the black points are all clustered here and the red points are spread all over right. So black points are between 20 and 24 whereas red points are between 12 and 24 and odd ok. So this is an important point so we will come back to it. So let us do the other thing, let us separate out the data ok. So this tells R that we are going to make two plots right and one on top of another ok and those two plots we are going to plot by using phase 1, phase 2. So the grep command so is going to get all the line numbers or rows which have data of phase 1 and this grep command is going to get the row numbers of all the data points which has data for phase 2 that is what i1 and i2 do. So if I plot xi16 versus xi12 so this will be only data that is corresponding to phase 1 because we have separated that those line numbers and i2 is for those data rows which have data about phase 2 so that is what this is and of course we are going to label them as phase 1 and phase 2 and x label is ASTM grain size y label is grain ID. So let us do this plot and you can see that phase 1 has grain sizes ranging somewhere between 20 and 24 and odd and phase 2 has between 12 and 24 ok. So this is about 3, this is about 12, the spread is 4 times as much like I said later we are going to see that both of them are going to give you a grain size which is somewhere about 24, both are going to show you the same grain size and of course this is going to show more of spread it will show a spread of 2 as compared to spread of 0.4 here so 5 times. However looking at this data now it is very clear that phase 1 and phase 2 grain sizes are completely different in terms of their distribution even though the overall gross properties like mean might be the same right. So that is the point behind making this plot. Like I said you do not have to make all these 2 plots separately dot chart will do it automatically for us. So we are going to look at that but before we do the dot chart let us do the stem plots and again we are going to do 2 stem plots. First stem plot is for identity 1 you can see so the decimal point is 1 digit to the left of the pipe symbol. So it is 20.8 and 21.0, 21.2 etc, 21.6 and 22.0, 22.0, 22.0 so these are the data points and you can see that it has a tail and then it slowly peaks and the peak is here there are 275 data with this 24.2 number and that is why the average falls somewhere about 24.2 because the rest of the numbers are small compared to this value and the data goes like this and then it just peaks right. So it has a long tail on one side but there is nothing on this side there is no tail at all I mean this is the peak and from the peak on one side you see that the data has a tail. So we can now do this for 2 stem plot of course it is much more skewed so you see these data points 11, 12, 13 etc and then you see that there are 52 more data points, 616 more data points, 1924 more data points. So this is also the peak and from there it just goes this way. So stem and leaf plots are nice to know about the structure of the data how it looks like which will become apparent when we do the histogram plot which we are going to do. So before that we are going to do the dot chart let us do the dot chart. So we had X notice that so let us do the dot chart notice that because we previously said that you have to do these 2 plots right, 2 rows of plots it continues if you want to change it you have to again give the command to make sure that it starts plotting single but in this case we want to see the 2 plots for phase 1 and phase 2. So let us continue so this is the second dot chart so you can see that this is the same as the 2 plots we made previously and so dot chart again gives you this information. Let us make a single dot chart so how do we do we say and then we do a dot chart of X and then we color right we color by factor X 1 right. So we are going to say dot chart of the 6th column which is the sizes and color by factor of 1 which means phase 1 and phase 2 should be plotted with different colors right here is a plot. Now you can see that dot chart automatically separates out the black points from red points previously just our scatter plot actually overlapped them but dot chart automatically separate them out and push them ok so you do not have to make 2 different plots just by calling dot chart with factor now you can look at how the data looks like ok. So which is very good which is a nice way of looking at data let us now move to the rank based properties ok. So we will do the cumulative distribution function of course so you can say ECDF plot dot ECDF of X 6 ok so this is the cumulative distribution function and you can see it just goes like that because we know that there is a peak somewhere here and it has a tail right and this is true for 2 also right there is also but it is a much longer and fatter tail as compared to the other one right. So if you actually do this 2 rows then you will immediately see because see both of them go from 0 to 1 and in this case you have so both are skewed both have long tails and the tail is only on one side that is what I mean by skew but the tail is much longer here and because they are on the same scale you can also see that this tail is much fatter than this right. So these information about these skewness and how much information is there in the tail both peak at this value somewhere around this value is where they peak ok but their tails have different distributions different characteristics ok. So let us go back to single plot ok so we have looked at the empirical cumulative distribution function and we will do ok so can we do that using ggplot ok let us do that also. So we use the library ggplot and library scales the reason why we are doing this is because we want to see whether the data will show any normal behavior obviously it is not going to show because it is peaking at one end and it is highly skewed normal distribution should have nice symmetric tails on either side which is not the case so we are obviously not expecting but if you see some kind of this kind of skew what kind of plots do you get right for the probability scale ok. So let us take a look at it let us ok. So I have plotted both this on the same plot so usual ggplot you have to say data is x1 and you plot the grain size and that is ecdf and then you have to plot the second data again the aesthetics is extreme grain size and you have to plot the empirical cumulative distribution function. So we have done and they are on the same plot ok and then what do we do obviously we are going to change the scale to change the scale and these are not needed we already have this information we also have this information so this is also not needed ok. So we are going to make two plots and what are those plots going to be we are going to do the probability distribution right the scale we are going to change it to probability scale normal probability scale obviously we do not expect it to be normal so we do not expect this curves to turn out to be a straight line we are just confirming right and so you can see it is not clear here that these two figures are one on top of each other but you will see it here. So in both the cases so this is also not a straight line sort of some amount of deviation and here it is very clearly seen that it is not a straight line at all ok and you can see that this does not go up to it is only 0.4 here and it is only 0.25 here the scale is not going up to 1 as you would have seen in other cases that is because the data is not symmetric about the mean it is only one side that you have the tail so that is what is seen in this also. So we will come back and we will do more of the other analysis histograms and box plots etcetera for the same data set thank you.