 course on dealing with materials data. In the previous session, we were going through the subject of descriptive statistics, how to describe your data in terms of either numerical values or the graphical values. We covered the three, the numerical aspects of it out of the two approaches. We first looked into the what is the objective of descriptive statistics. Then we said there are two approaches, one is numerical and graphical and in the last session we covered up to numerical methods in which we covered the measure of central tendency and also the measure of dispersion. Today we are going to cover the graphical methods, the relationship between mean, median and mode and the correlation coefficient when there you have a data sets with two variables. Graphical methods you are all very familiar with, you see them more often in the newspaper these days with the data journalism becoming very popular. So, we have a histogram or a bar chart which is also called a frequency plot. Sometimes the variety of category of data that comes into it, it can also be shown as a pie chart in which you show different percentage of data which comes from different areas. And then we have a cumulative frequency plot which is very much close to the frequency plot that we are discussing here. And then the last is box and whisker plot, it is actually a box and it has two whiskers and therefore it is called box and whisker plot, it is not by the name of any statistician or any scientist. So, here for an example I have taken a data, this data actually comes from the mechanical properties or the strength properties of super alloy. Here you can see that there are couple of codes given to the data, there is a temperature attached to the data, this is the yield strength property of the data, this is ultimate tensile strength property of the data, this is elongation and this is reduction of area, both of these are in percentage. Now, from this data if I want to plot a histogram, what I do is I generally find the lowest value and the highest value of the data and I divide it into a number of bins. So, in taking a number of bins there is no thumb rule, there are no written rules, but there are some thumb rules. Here generally I like when the data is very large as is in the case because in this case we have 311 data points and therefore we have divided it into a 10 bins and these values you see 875 MPa, 924.5 MPa, etc., etc. This data is a yield strength midpoint of the bin value and here is the count, the count number of data points that falls in it and here is a cumulative frequency that is this is the 1, then this is 15 out of 311, then this is 15 plus 1530 out of 311 like that this is the cumulative percentage of data that falls below this particular bin value and here is a plot, this plot is called a histogram in which at every bin value it has shown the length of a bar which corresponds to the frequency of the data and this red plot is actually a cumulative frequency curve, the y axis on the right side shows the percentage as in terms of the cumulative percentage and the left side y axis actually gives the frequency. This is called an histogram. Histogram gives us a lot of information. For example, when I make the same similar plot for the ultimate tensile strength, I find that these are bimodal. You remember the definition of mode. Mode of distribution is where the data point occurs with the highest frequency and you can see that here 1166.5 MPa is the highest frequency here while here it is 14,044 is the highest frequency. So, you find that this is a bimodal data and it can be shown that this happens because there are two temperatures. If you look back, the data has temperature values of 25 degrees which is a room temperature and high temperature is taken as 650 degrees and therefore you see the two peaks one belongs to the room temperature, the other belongs to the high temperature. So, what I am trying to say here is that histogram actually gives out a lot of information in the beginning about the data so that you know when you do the actual analysis statistical analysis of the data how to deal with the data. Here is a pie chart. As I said there were some codes defined so code 1435 and here it shows that how much data belongs to which code value. One code value is defined as per the ASTM number for the grain size. So, therefore it is divided into this manner. So, this is a pie chart box and whisker plot. This is a one plot which talks gives away a lot of information. The as the box and whisker name says there is a box to it and there are whiskers attached to it. You see there are the whiskers and this is the box. The box is made in this way. You have the interquartile range given here. So, if we understand that this is the minimum and this is the maximum value then this is Q1. Remember what is interquartile range? It is Q3 minus Q1. Q1 says that 25 percent of the data is below Q1 and Q3 says that 75 percent of the data is below the Q3 or 25 percent of data is above Q3. Please remember this is what I say smallest value to largest value and therefore the below and above look the opposite in this particular figure. This plot also has one median line shown which is actually the median value of the data. Here is the interquartile range. Then the whiskers length vary. Number of times whiskers length is given in terms of the maximum value and the maximum value here and the minimum value here. But it all depends on different softwares and different approaches. It can be said that it is generally taken as a k times interquartile range on both the sides. Typically here I am showing you box and whisker plot which I have obtained for a chemical analysis of aluminum in sultan alloy and I have three laboratories at where this alloy has been chemical analysis has been tested or analysis has been done. Punished Mishradhatunigam other is DMRL and the third is NFTDC and this shows the results obtained in either case. As you can see here when they show it like this it means that it has taken maximum and minimum value. When the whisker has a short horizontal lines on top and bottom it is maximum and minimum values. Here of course it goes from minimum to maximum so what I described here it is upside down here please remember. So here it shows that the maximum value is somewhere in between 55, 5.5 ppa to somewhere around 5.7 ppa while in the case of DMRL it is a very wide range data. This shows the median and this can also give you why have I shown this. This also gives you a way to compare the data from three different sources. The same data has been collected from three different sources and this is the plot which shows you the three different ways of expressing the data. Now we come back to measures of central tendencies something that we had left out why because this relationship is best shown in terms of frequency plot. Now you recall for example if you look at this if you join these points together you get a plot this is called a frequency plot and here I have shown a nice bell shape frequency plot this is the data which has a distributed in perfect symmetry such as it can happen in normal distribution. So when a data is distributed in a perfect symmetry the mean is equal to median is equal to more it means that all these three values reside in the same point. You can see that it resides in the same point and where the mode is because mode is very easy to find out from a frequency plot it is the highest frequency or the highest frequency the value occurs and therefore mode is at the highest frequency level and here you can see that mean, median and mode are same if the data is in perfect symmetry. What happens if the data is skewed? Now you see this long tail this long tail on the right side is identified as positively skewed data when a frequency plot has a long tail on the right side it is called positively skewed. Now when you have a long tail instead of symmetric you have a long tail on right side it means that the frequency of data occurring on the right side is higher than what it would have been in a perfectly symmetric curve and therefore the mean value gets shifted towards the right. The median remains in the center because it divides the data into two halves please remember median divides the data into two halves. So 50% of data is on this side, on this side while 50% of the data is on this side. Mode always remains at the highest frequency point. So mode is very easy to find out and the mode is here, median is here and the mean is here therefore the relationship becomes in the positively skewed distribution mode is less than median is less than mean. This relationship helps when you actually find mode median and mean in descriptive statistics and you find that they are strikingly different and having this relationship you already have an idea that your data is positively skewed. On the other hand if the data is negatively skewed you see that a long tail is on the left side and therefore such a data is called negatively skewed. Again because on the left side the frequency of data has increased the mean has moved to the left of the center, median remains at the center. So again this side of the data is 0.5 or 50% and this side of data is also 0.5 probability this is P is equal to 0.5 probability because that is how the median is at the center. Mode is always the highest frequency and therefore when the data is negatively skewed when the data is negatively skewed you will find that mean is smaller than median is smaller than mode. So once again from the descriptive statistics from the descriptive statistics if we come to know that there is a large difference and the relation is mean is smaller than median than smaller than mode then you have an idea that you are going to deal with a negatively skewed distribution. This you can further confirm by plotting a histogram and have a look at it how much skewed it is but this is how the relationship between mean, median and mode is. So if we go quickly go through it if it is a symmetric curve perfect symmetry a curve distributed in a perfect symmetry mean is equal to median is equal to mode if it is positively skewed then mode is smaller than median then is smaller than mean and if it is negatively skewed then mean is smaller than median is smaller than mode. Next we would like to define something called correlation coefficient. So far we were dealing with one data set now suppose you have two data sets we have two data sets x1, x2, x3, xn and y1, y2, y3, yn. The covariance of x and y you know that what is a variance of x which is a variation within x variance of y shows the variation within y with respect to its mean value y bar and here it is with respect to its mean value x bar. Covariance means that how do the two together vary and therefore the covariance is defined as summation of xi minus x bar multiplied by yi minus y bar as it can be seen that this value this value is not necessarily positive actually we find that covariance can be negative and it can be positive it can be anything while the correlation coefficient is then defined as covariance of x and y divided by the square root of variance of x and variance of y. You can see that covariance has a unit correlation coefficient as the name says it is a coefficient it is a unit less it is unit less quantity it is a coefficient. Covariance has a unit which will be the unit of x multiplied by unit of y remember variance is a square of unit of x variance of y unit of variance of y is a square of unit of y and therefore when you take a square root it becomes a multiplication of unit of x with unit of y same unit with covariance and therefore correlation is a unit less quantity. It can be very easily shown that correlation lies between minus 1 and plus 1 it actually also expresses the amount of linear relationship very important it actually expresses the linear relationship between x and y I will talk about it a little later the covariance of let us see covariance correlation of x and y is minus 1 it implies that it is a perfect linear relationship with a negative slope. So, if I want to draw a small picture here this case means that if you draw the perfect relationship this is x and this is y then the relationship will be like this when the correlation is perfectly minus 1. When the correlation is perfectly 1 between x and y the relationship will be a straight line with a positive slope. Any value in between indicates any here there is a spelling mistake it should be indicates somewhat imperfect linear relationship. So, for example if we have a set of data which goes in this manner you can see that there is a linearity in its trend you can see that there is a linearity in its trend but it is not perfect linear here if it is a perfect linear you will find that here the data would fall perfectly on the line right but it is not perfectly on the line. So, there will be some approximate line going on and that line may have some relationship and that is it is called imperfect relationship it is an imperfect relationship when correlation is 0 it implies that there is no linear relationship between x and y. So, this is very important there is no linear relationship that is why I have earlier also noted that it is a linear relationship and this says there is no linear relationship. For example, if I take a perfect parabola you will find that correlation between x and y would be approximately 0 this is x and this is y. If you take exact symmetric data points on this parabola you will find it will become exactly 0. So, it is very important to realize that correlation the correlation coefficient explains only linear relationship between x and y. If it is minus 1 it is perfectly negative slope it is plus 1 it is perfectly positive slope anything in between is somewhat imperfect relationship. Again if the imperfect relationship is like this your correlation coefficient will be greater than 0 but less than 1 and if you have a case in which your data is distributed somewhat like this then it is going to be correlation is going to be less than 0 but of course greater than minus 1. So, it gives you an idea whether relationship is on a with a positive side or a negative side but it would not give you perfect 1 or perfect minus 1 and when it gives you perfect 0 please understand that it only says that there is no linear relationship. So, with this we complete the sections on descriptive statistics. Let us quickly go through it first we had an introduction to the course and then we had a data description we studied the numerical methods in which we looked after measures of central tendencies mean median we also talked what to choose when we also studied the relationship between mean median and mode in which you can decide whether it is a positively skewed data or a negatively skewed data. If it is positively skewed data mode is smaller than mean is smaller than mean median is smaller than mean if it is negatively skewed data then mean is smaller than median is smaller than mode. Then we studied the measure of dispersions such as range, standard deviation or variance and we also learned about the inter-quaritile range. We talked something about the moments of the data then we studied the histogram the pie chart these are the graphical methods box and whisker plot how they are useful in explaining the data before for the others to know what is the data looks like. We again it is a repetition mean median relationship of symmetric and asymmetric distributions we used by doing the histogram and then finally we introduce the correlation coefficient with this we conclude the chapter on descriptive statistics.