 Welcome back. In today's lecture, we will be looking at exploratory data analysis. So far we have studied about random variables both discrete and continuous. We looked at some of the parameters that are encountered when dealing with both continuous and discrete random variables, the mean variance, standard deviation and so on. We also looked at the moments. Then we came on to normal probability distributions and one of its variance namely the log normal probability distribution. Before we go further into details of statistics, it is worthwhile to take a small break and look at presentation of data. This is also very important to us. Let us say that you have conducted the experiments and you have the data available with you. It will be a good idea to subject the data to preliminary data analysis to get a feel for the data trends between what range of values you find the data points, whether there are some unusual observations and whether the data are looking linear or they are showing a strong curvature when the response is plotted against the main variable or the variables. You also may want to see whether the distribution of the data is following a certain standard distribution. You may want to check whether the distribution of the data is normal. You may also want to present the data and what are the different effective ways of presentation. What will you actually look for in a data? So these are some of the things we are going to discuss now. Of course, this discussion is not exhaustive or complete. It is just a starting point. There are many more data analysis procedures which you may definitely understand when you read up after getting exposed to this lecture. According to the references, there is a nice book by D. Corsi, Statistics and Probability for Engineering Based Applications with Microsoft Excel. We are also going to follow our usual reference books, the one written by Montgomery and Runger. Also, we will be following Ogana A. K. The name of the book is Random Phenomena. Coming to the motivation of the exploratory data analysis, as the name implies, you are having the data and you are going to do an exploratory analysis of the available data. Statistical analysis are based on data. What is the difference between statistics and mathematics? Both seem to use for example, integration, differentiation and many more methods of analysis. Statistics is based on data while mathematics is based on pure numbers. Well, this kind of interpretation is of course arguable but this is one way of looking at it. So when you have done the experiment, you must familiarize with the trends of the experimental data set. As an experimentalist, you should also look for the center point of the data distribution. You may want to look at the average value and also quantify the spread of the distribution. In order to get a proper idea on the spread, you may want to first plot the data and after plotting, you should be able to discern its essential features without too much of textual description. You should also see whether there are any rogue data points in the data, data set you have collected and it is better to become aware of these right at the outset because if you can detect the location of these outliers, you perhaps may repeat the experiments corresponding to the occurrence of these data points and see whether you are still getting the same values. If you otherwise wait until the end of the experimentation, then these rogue data points may stick out like sore thumb and then you would be at a loss as to what to do with them okay. Then you will have to attribute some reasons why they may have occurred. So the moral of the story is if you have any outliers, you better locate them right at the outset and take suitable action. You may want to speculate that the given data belongs to a particular statistical distribution. This may be based on what other people have experienced with this type of data. For example, particle size distributions are commonly expressed in terms of the log normal distribution. So you may want to assume that the particle diameters in your data set can be expressed in terms of ln d where d is the particle diameter and then you may want to show them in the form of a log normal distribution but you have to confirm that indeed the data belongs to that particular distribution. So you should be able to plot the data suitably and demonstrate whether this assumption is justified. A picture is worth a thousand words and hence it is always better to present your data in a compact economical and effective manner. In the corporate sector most of the presentations involve data analysis and they have to be presented in a compact manner. A lot of information should be present in 1 or 2 diagrams or presentations. You do not want to show 10 or 20 graphs to drive home your point. Let us look at the box plots. We came across the box plots in the introduction session, the very first lecture if you recall. The box plot is also called as the box and whisker plot. In this plot you are able to show a lot of information. You can show the data maximum and minimum along with the other characteristics like the quartiles, median, the outliers etc. The box plots are economical and the lot of information can be presented in a compact manner. If you have different sets of data and you want to compare them box plots are quite useful. Let us look at the features of the box plot. I will show you a diagram of the box plot. Here we are comparing 2 sets of data. This is the so called box and these are referred to as whiskers. So this is referred to as the first quartile. The second line is the second quartile. The third line is the third quartile. Quartile you may relate it to quarter or one fourth okay. So in this diagram we have shown the whiskers. We have shown the box. We have shown the first, second and third quartiles. Now let us go back and see the definitions for this quartiles. The 0th quartile is the data point with the lowest value. For example in a class where the marks are distributed the teacher may want to arrange the marks from the lowest mark to the highest mark. Usually this is not done and the marks are distributed in the random manner. But some teachers want to present the papers in the ascending order of marks. So coming to the 0th quartile it is a data point with the lowest value. The first quartile it refers to the value below or equal to which 25% of the data are present and above which 75% of the data are present okay. The second quartile refers to the data point below or equal to which 50% of the data are present and above which obviously the remaining 50% are present. The second quartile is also equal to the median. The third quartile by now you will be familiar with it. The 75% of the data are located equal to or below this value and 25% of the data are located above this value. Highest number. The interquartile range is the difference between the third and the first quartiles that is Q3-Q1. Q3 is the third quartile and Q1 is the first quartile. What about the whiskers? If you recollect I told that these vertical lines shooting out of the boxes on either side are termed as whiskers. The whiskers are drawn from the edge of the box to the data point that is located within 1.5 times the interquartile range okay. So you have the first quartile and the third quartile then you want to identify data points that are located at 1.5 times the interquartile range from these two quartiles okay. The whiskers need not be of equal length. So this value is the one which is lying within 1.5 times the interquartile range. These two distances are not the same. But however this data point at the very end of the whisker is falling within 1.5 interquartile range. Similarly here also you have whiskers but it appears that the data point at the edge of this whisker is lying at the same distance from the third quartile as this point was lying from the first quartile. So it depends upon the data set. We already discussed about the performance by the students in the lab and in the course. Lab is more of a group activity and so the marks are sort of closer to each other when compared to the core course performance. If there are any data points okay below or above the respective whiskers then they refer to as outliers. What we are trying to do here is whatever data point is falling within the quartiles or close to the quartiles are considered to be the kind of expected points. And if you are having any points which are lying beyond the 1.5 times the interquartile range from the first quarter or the third quarter is considered to be a rogue point or a outlier. This box plot can be generated in different ways. I have used the version 16 of Minitab to generate this box plot diagram. Now let us look at another kind of plot namely the scatter plots. It shows the scatter and the data to put it simply. It shows the 2 data sets on a regular graph sheet. It compares 2 data sets. You may want to plot the first data set along the x axis and the second data set along the y axis and then see if there is any correspondence between the 2. We want to see whether there is a dependency between the 2 data sets okay. For making a comparison of course the length of the 2 data sets should be the same. The first data set has 20 points. The second data set should also have 20 points. I will demonstrate the scatter plot with the help of an example. Let us say that we are looking at a batsman's performance over the years and we want to show runs scored in a calendar year as a function of the batsman's age. Whether the batsman is getting better with the age or he is getting worse with the age or he goes through an optimum phase before beginning to fade out. Well when we look at this particular scatter plot for this particular batsman, if you look at that graph the runs scored per calendar year is shown on the y axis and his age is shown on the x axis. On this diagram it can be seen that there is no apparent relation or dependency between the runs scored per year and age. The runs scored per year may have fluctuated based on other reasons. They might not have been due to the aging of the batsman. So this clearly shows that there is no dependency on the runs scored with the age of the batsman in the range of 20 years to 30 years. Normally when you do an experiment and collect the data it is referred to as the sample. So we would like to look at the sample properties and the most common one would be the sample mean. The sample mean is denoted as x bar and the variance is denoted by s squared. We have earlier seen the mean being represented by mu and variance being represented by sigma squared. For example in the normal distribution the mean was given as mu and the standard deviation was given as sigma squared. But remember here we are talking about the sample. Earlier we were talking about the population. The population parameters were given in terms of mu and sigma squared for the mean and variance respectively. Here we are talking about the sample and we denote it by x bar and s squared. The sample mean is the arithmetic mean and we just sum all the data point values and divide it by the number of data points. The s squared is the sample variance which is defined in terms of the deviation from the mean. The deviation of h in every sample data point from the mean and that is squared and then added. And after that we divide by n-1. So what we saw was for a discrete data set the mean is a measure of central tendency. For a discrete distribution the mean is also referred to as the expected value of x and that is given by sigma i is equal to 1 to n xi f of xi and in the case of continuous distributions we have mu is equal to expected value of x that is equal to minus infinity to plus infinity x f of x dx. So looking at the properties of the data set we are defining for a discrete distribution we say that the mean mu is expected value of x and that is given by sigma xi f of xi. We are also defining the arithmetic mean as sigma xi divided by n. So are we having two different definitions okay. What is then the basis for the arithmetic mean? It is actually quite simple. If each of the xi values have these identical probability of occurrence then f of xi will be simply 1 by n. So you put 1 by n here that is independent of the index i. So we have mu equals sigma xi f of xi that will become sigma xi by n okay. So that will be the capital N where n is the total number of entities in the population. Now we do the same thing for the sample mean. When you have the sample we use the same formula but a sample is a subset of the population. The number of entities in the sample will be much smaller than the population. We may not be finding it practical to take the data from each and every entity in the population. So we take a representative sample from the population and get the important characteristics. So when we do the mean that is denoted by not mu but it is denoted by x bar and that is given by sigma i is equal to 1 to n xi by n okay. Here n is small n. It should not be confused with capital N. Capital N is meant for the overall number of entities in a population. It may be even going into lakhs or millions. So it can be a huge population but a sample is usually in the order of let us say 30. It can even be as low as 5. It can go up to 30 or 40 okay. The sample need not be larger than that. So the sample mean is defined as sigma i is equal to 1 to n xi by n. We did have f of xi but since the probability of occurrence of each of the item in the sample was identical it became 1 by n and so we have x bar is equal to sigma i is equal to 1 to n xi by n. This is the most natural way we take the average of a finite dataset okay. You have other definitions such as the geometric mean and the harmonic mean. However, these are not commonly used as that of the arithmetic mean. The arithmetic mean balances the extent of deviation both positive as well as negative of the data points from itself. In fact, you identify the mean in such a way that the positive deviations from that will balance out the negative deviations such that the total sum of these deviations will be equal to 0. The problem with this kind of definition is the presence of unusually large value or an unusually small value may influence the average value. The average is a measure of the overall dataset okay and let us say that a batsman is playing in a 3 test series and if he has scored in 4 innings, 200 runs the average may look to be a healthy 50. On the other hand, if he has scored 200 and then scored ducks in the remaining 3 innings, then the average of 50 is not a good representation of his performance okay. He has performed very well in the first innings and then done nothing in the remaining 3 innings okay. So when you are having a small dataset and you are having extreme values, the mean value may be influenced by the presence of these extreme numbers. The median is also a measure of the central value in the distribution and we also saw in the box plot discussion that the median is the second quartile. How do we find the median? It depends upon whether the dataset has odd numbers or even numbers. You arrange the data points in the ascending order. When you arrange the data points in the ascending order, you put the smallest number first and the largest number last okay and then you find out the median. If the number of data points in the sample is an odd number, then the calculation of median is quite simple. We identify the data point which is in the middle of this spread okay. Suppose you have 2M plus 1 data points which are arranged in the ascending order, the median will correspond to the M plus 1th data point okay. When on the other hand, you have even number of data points say 2M which are arranged in ascending order again, the median will correspond to the average of the Mth and the M plus 1th data point okay. Suppose you have let us say 4 numbers which are arranged in ascending order, then you find out the second number which is M and then you find out the third number which is M plus 1. Take the average of those 2 numbers to get the median. The median involves only a ranking and the presence of unusually small or large data points will not affect the median value and hence it is considered to be more robust in estimating the central tendency of the distributions as it is not affected that much by the outliers. The extreme points they may take any value but here you are not actually doing the adding and then dividing by the total number. So the value of these numbers are really not affecting the calculations okay. It is just ranking them and then seeing what is the number which is going to be there in the middle. And the outliers are obviously extreme data points. They are very very low data values or very very high ones. So you would not get outliers in the middle of a distribution. It does not make any sense. You will have outliers only in the extremes of the distribution. So in that sense the median is some more robust way to find the central tendency of the distribution. If the numbers are highly asymmetrical with the many values considerably different from the mean then the median is preferred. We also have the mode. The mode by definition is the number which appears most frequently in the data set okay. This you might have studied in your high school itself. In a discrete collection of data it is the most popular value. Decoracy terms it as even the most fashionable item in the data set. Who said numbers are dull okay. They have very interesting properties. So now we want to look at the spread of the data. We have looked at the central tendency. Now we will look at the variability in the data. The mean and median gives an estimate of the number that is located at the center of the distribution. However it does not indicate how the other data points are clustered around the center point okay. Whether the points are very close to the mean value or they are wide apart from the mean value. It is very important for us to know the scatter about the mean value. It is as important as knowing the mean value itself. And the variability in the data is what influences us when we make decisions during experiments. The variance is the parameter which influences our decision making in statistical data analysis. What is variance? Variance is based on the deviation from the mean. But we know that the deviations from the mean add up to 0. So our aim is not to get the actual values. We want to get an overall idea about the spread. So whether it is negative deviation or positive deviation, we want to give them equal importance. And so we square these deviations. And once they are squared, then there is no difficulty because the sum will not be equal to 0 in most cases. So we will have the square of the deviations from the mean. And we add up those deviations squared and then divide by a suitable number. That suitable number we will discuss very soon. The next measure of the spread is the sample range and it is defined as the difference between the largest and smallest values in the data set. This is a kind of a shortcut to estimate the spread of the data. You are using only 2 data points. You are not considering the entire data set. Well, if you want to look at the spread, it will be better if all the data points are participating in the exercise. You take only the smallest number and the largest number and find the difference. That is usually not rigorous. It is a useful and a quick estimate, but it is not very rigorous. For example, the largest and the smallest data may be outliers. And so the other remaining data points may be very close to the mean value. If you go by only the largest and the smallest value, you may be overestimating the spread. Whereas in the actual case, the spread may have been quite smaller. The overall spread would have been quite smaller than what was reported by the sample range. The range is useful when you want to compare different data sets of equal sizes. Decorcy observes that when the size of the data set increases, the range also tends to increase along with it. In some research papers, you might have come across the average absolute deviation. As the name says, there are different ways to handle the case where you have positive deviations and the negative deviations. One way is to square them, but when you square them, you are sort of changing the order of magnitude of the number. If it is greater than 1, when you square it, you are having a higher order of magnitude. If the number is less than 1, you are going to have a lower order of magnitude after the squaring is done. This is a slight manipulation of the data. And of course, after you take the variance, you take the square root and get the standard deviation. Another way to handle this issue of positive and negative deviations from the mean is to ignore the sign. So what we do here is, we take the absolute value of the deviation from the mean. We write it as d bar which is equal to 1 by n, sigma is equal to 1 to n di, where di is the deviation xi from the arithmetic mean. So as far as the average absolute deviation is concerned, the presence of a large valued outlier can cause this estimate to be also affected. When you want to present your deviations between your model predictions and experimental data, you may want to do so by using the average absolute deviation. It may so happen that except in one case, in all other cases, the data is matching rather well with the model predictions. However, because of one outlier, the model may be showing a prediction much different from that of the experimentally observed value. And this may increase your average absolute deviation and make it appear as if the comparison between the model and experimental data is not that good. So you may have to check for this outlier. Of course, you have to go into the root of the matter rather than simply removing the outlier. The average absolute deviation is a simpler alternative to the standard deviation. So now we will be again talking about average absolute deviation, but this time we will be talking about it with respect to the median value. Earlier we were talking with respect to the arithmetic mean. Now we are going to find the deviation with respect to the median. So a small typo is there. I will just correct it. The subscript has again not been implemented. So I will just make it a subscript. So the average of the absolute deviation from the median is more robust than that based on the mean. It is pretty useful and we denote the absolute deviation from the median in terms of D bar m. Now let us come to the most rigorous form of identifying the spread in the data. Here we find the sum of squares of the deviations from the mean and divided by n-1 where n is the number of data points in the data set or sample. This is called as sample variance and this is very popular. Please note that the variance is always a positive quantity because all the deviations have been squared and all of them have now become positive. Of course we have to deal with only real numbers. We do not deal with imaginary quantities. So after squaring we always have positive values with us. The mathematical formula or equation for variance is given by s squared is equal to sigma i is equal to 1 to n xi-x bar whole squared by n-1. To find the standard deviation s, we simply take the square root of the variance. This is the squared deviations, xi-x bar whole squared is referred to as the squared deviation and since we are adding all those squared deviations and dividing it by n-1, we have a mean square deviation and then we take the square root. So the standard deviation is also referred to as the root mean square deviation from the mean. This concept of mean square is very important and we will encounter this frequently in our design and analysis of experiments. We call it as the mean square error. You may want to refer to the first lecture, the introduction where we talked about mean square error for the fertilizer example. Please note that the mean can have negative value. It depends upon the range of numbers. I have already told that in one of the earlier lectures. The standard deviation also has the same units as that of the data set. If you are having the data set in terms of particle diameters, the standard deviation will also be expressed in terms of particle diameters. They may be referring to particle diameters. So the dimension would be more appropriately micrometers. Then the standard deviation will also be in micrometers. The mean of this distribution will also be in micrometers. So the standard deviation and the mean will have the same units. The mean however can take negative values whereas the standard deviation is the positive square root of the variance and so the standard deviation will have only positive values. In the formula we used to find the variance or the standard deviation we used n-1. Why did we use n-1? Why not n? We used n in the calculation of the mean whereas in the calculation for the variance we are using n-1. The term n or n-1 is referring to the degrees of freedom. n of course stands for the size of the data set. We are looking at the number of independent entities in the data set okay. If you collect the data set the entities in those have been chosen in such a way that they are independent. So when you are finding the mean value you are dealing with n independent entities. However when you are calculating the standard deviation or the variance you are basing those calculations on the deviation from the mean value okay. Not all the deviations are independent. The mean has been defined in such a way that the sum of the deviations from the mean will be equal to 0. So this is acting like a constraint. So you have only n-1 deviations from the mean that are independent. So this is a interesting situation. How do we deal with this okay? So the number of independent entities is only n-1 and so we use n-1 in the calculation of the sample variance. There is also another reason why we use n-1. I will come to it shortly.