 Let us continue exploring, we just made a scatter plot and now we are going to make a stem plot, stem and leaf plot is very easy, so you have to just say which is the variable, so in this case we are plotting the grain size and so we just say stem and you get this date. So now you can see the decimal point is at the pipe symbol itself, so it is 9.34910.939.499.3. 10.4, 10.4, 10.5, 10.6, 10.6 etc. Okay, so this gives you a good idea also about the data spread and you can already see that okay, so this looks like a peak and there seems to be another peak towards the end, right? So it looks like it has two peaks, that is what the stem plot indicates and of course you can also make a dot chart, so we are going to use the dot chart and the same quantity, right? So we want to look at the ASTM grain size in dot chart and so this is the dot chart, right? So this is the grain size somewhere about 10 to somewhere about probably 27, 28 and the dot chart shows you, previously I showed you that dot chart can show you the extreme points or the outliers but it is very difficult from this figure to figure out which one is an outlier. So dot chart and stem and leaf plot are two other ways of visualizing data and once we have done that of course let us go do some rank based reports and represent them graphically. We have seen empirical cumulative distribution function and we know that the easiest way to generate one is to say plot ECDF and we say 5, right? This is to plot the empirical cumulative distribution function, right? So you see this and so you can see the, there is another way which also we have learned, you can just say plot ECDF of this data, right? So that is no different and we also saw yesterday how to make our own cumulative distribution plot, just to remind ourselves how it is done, let us do it once more. So it is a good idea to know what this empirical cumulative distribution function is. So let us generate it ourselves, right? So this is the one. So let us look at it. So first I say that okay take the fifth column and call that as x and sort that x in increasing order because decreasing is false and store it back in x, y is the cumulative sum of x and we are going to take the final value and we are going to divide by that so that the numbers go from 0 to 1, you see. So it is normalized. So this is the normalization step and then we are going to plot the grain size versus the cumulative function and we are going to use the step type for plotting and of course x label is ASTM grain size and y label is normalized, y label is normalized CDF, right? So you get this. So this is no different from the previous figure that you generated but except that now we have written the code ourselves. So R is both a software and a programming language. So you can just call a function or you can write your own code to do the same thing, okay? So this we have done earlier also once but this is just to remind you of how a CDF is generated. Okay, of course one can use ggplot and we learned yesterday that using ggplot it is easier to change the y scale and so we have to use the library scales also and ggplot you have to tell which is the data, you have to tell which is the aesthetics. So we want to plot the ASTM grain size and what is the plot? It is a eCDF plot and so you have to do the statistical analysis for the cumulative distribution and the scale of y should be probability scale. So if it is actually a normal distribution then this will look like a straight line. So by looking at it you will know what the distribution of the data is. So that is the reason why we want to put the scale and see if it actually shows that or it shows any deviation. So let us do that and we see that here also there is a deviation so it is not a straight line. So you do not expect the grain size distribution to be normal and of course there is a warning message. So the transformation introduced infinite values in the y axis. So but that is not crucial so we do not worry about it. So let us plot histogram and that is easy. This is a histogram and as we saw when we did the Stem and Leaf plot. So there is a peak this value then goes down and then there is a smaller peak here which is what you saw here it came down and then it went to a peak here. So you can see in the histogram also this sort of second peak it is not quite a peak but it is a rather largest tail and very fat tail right. So if this is distribution then this is much larger tail, fatter tail. So this is very common sometimes data does not really show nice bell shaped curves and here is an example. After histogram plot of course yesterday in the previous session we did the box and whisker plot so let us do that so let us say box plot. So we have the box plot and as usual so we can make the box plot with the horizontal bit true. So you can see that this is the mean and this box actually represents the second and third quantile. So to get this idea let us do this command now quantile which will clearly show what is happening right. So 50% of the data is somewhere here 15.1 right and 25% of the data is from 25 to 50 happens between 13.2 somewhere here to 15.1 and 18.6 is by the time 75. So 25 to 75% of the data lies here and this is on one side the first quantile and this is the last quantile. So that is what this box plot actually represents so it gives you an idea of spread of the data. So this is another way of looking at the spread of the data so that is what we have seen here. So once we have the, so we have exhausted all the rank based reports that one can prepare and we have even looked at one of the summary values and of course we can get the other ones. The one is mean okay so that is 16.3 right so that is the mean value let us look at the median value that is 15.1 that is where this line is there this dark line actually represents the median and variance, so variance is 15.5 and standard deviation and that is some 3.9 so that is the standard deviation. Of course we want to plot these numbers along with this scatter plot to get a better idea so let us do that. So what are we doing we are plotting the data like a scatter plot and then we are drawing lines for the mean and the median and mean plus standard deviation, mean minus standard deviation, mean plus 2 standard deviation and mean minus 2 standard deviation okay so you can see that mean is here 16 something 16.3 and median is 15.1 and these green lines represent points which are within one standard deviation and the blue lines represent points which are within the 2 times standard deviation so this blue line you cannot even see here so these points are all lying between mean and minus 2 times standard deviation. But on the other side you can see large number of data points that are lying just outside of this 2 sigma so these are basically the points that are outliers right. So to summarize we have already looked at quantile we have plotted the data so that is the first thing that we did and we had a gap x axis but you do not have to do that because your dot chart for example does the same thing without any introducing any gap or anything right just looking at the numbers this is because it is really not putting the grain IDs if you have to have grain ID here and the numbers there then you have to use a dot chart the gap chart gap plot but dot chart otherwise can give you the complete data in one go. So these are ways of looking at the data then we made several rank based reports and represented them graphically CDF, histogram box plot and things like that and then we have made the summary based reports like mean, median, variance, standard deviation, quantiles etc. And then we actually plot all the data points and also these summary based numbers on the same plot to have an idea about the spread of the data and outliers. So this completes the analysis descriptive data analysis for data set one and now let us take the more complicated data set two which is meant for two different phases and do the analysis and see what that has to tell us, thank you.