 Hello and welcome you all learners for this online video session on data quality and pre-processing for the course data analytics. Myself Mr. Vipul Kondekar from Walchand Institute of Technology, Sholapur. So these are the various learning outcomes for this course. You will be able to identify what are the exact problems affecting the quality of the data and what can be the solutions so that the data quality gets improved. So the main problems are discussed or enlisted here which are affecting the quality of the data. So the problems are like missing values present in the data, some inconsistent values, sometimes you have redundant values present in the data, there may be some noisy data present and outlier values. So these are these are some reasons which are affecting the quality of the data and you know you need to deal with the quality of the data before giving that data as the input to any of the learning machine learning algorithm. Because if you give this raw data as it is it may take the machine learning itself may become complex or it may take more time for learning. And hence you should handle these data quality problems before giving input to any of the machine learning algorithm. Then the question comes what are the reasons why the data quality is affected. So the data quality may be affected because of some internal factors like the process itself. So you have some problems in the measurement process or data collection process itself and hence there are some problems associated with the quality of the data or some external factors like the way you have collected the data, the some sensor properties, maybe some human errors will be there and all these are resulting into degradation of the data quality. And then you need to handle or you need to improve that data quality or you should take care of the data quality. Now talking about these problems let us one by one let us go through the problems explore these problems one by one the first problem was missing values. So if you look at this table the left hand side of the table you will find that for many instances when you have the tabular representation of the data we will say these columns are representing the attribute values and rows are representing the instance values. So they are representing the instances. So let us say you have a person in your contact list who likes burgers but I don't have the age information. But somehow if you could come up with those values feel means if you find this right hand side of the table it says that is the data without missing values. So first let us think what may be the reasons because of which you are getting the missing values in the data. So one basic reason may be there is a difference or there is some at the time at which the data collection was started and the time at which the recording was started there is some difference. So you miss certain values or few values were not known at the time of collection or there may be some distraction at the time of collection some values were not required that's where that's why they were not available some values were not existing at all. So there are few reasons because of which you may get the missing value but okay if it is a problem then what can be the solution for this. So where how to deal with the missing values. So when you have missing values in your data very first approach may be you go on ignoring those values just don't consider those values for the analytics part. Second approach is remove the objects it's like suppose you have one instance for which five attributes are there out of five attributes one attribute value is missing. So then do one thing remove that object or remove that instance itself for the analysis. And third and more engineering approach to deal with the missing value problem is can you come up with some estimation of that missing value. So can you can you come up with some algorithm so that you will be able to estimate what that missing value is. So this is how you can deal with the missing value problem. Second problem is redundant data redundant data problem is also called as duplicate data problem. So here what may happen is you have a data where some duplication is there. So if you find this row number two and row number three it is representing the same information Italian food age 43 distance very close company good. So this is duplication here so same is observed for this particular these particular two rows. So here is duplication so what you can do is data without redundant objects will be neglecting those particular rows where duplication is there. So this is how you can deal with the redundant data problem. The next problem may be inconsistent data values. So when you have your data you find that there are some values like this is the data of the contact list of 14 different instances where for each of the contact the information stored is about what is the max what was the maximum temperature in that person's region and in this the temperature values are represented in degree centigrade and I could find that the cave in is a person and because of may be some typo error. So instead of 30 the data available is 300 300 degree centigrade temperature. So this is inconsistent data or a person having a weight of 1100 kilograms inconsistent data a person and having height of 10 centimeters again inconsistent data. So you should you should check these inconsistencies present in a data before giving the data as the input to end of the machine learning algorithm. Next problem may occur in the data is sometimes you get noisy data noise basically is a unwanted signal. If you look at suppose this is a scatter plot for two different attributes where few instances are representing the sick condition and few instances are representing the healthy condition and it is without noise. So it is very easy to classify these instances between the two classes whether the person is sick or healthy but if you have some data points available here. So these are some noisy data points they are neither representing the healthy condition nor representing the sick condition and then they are making the classification task much more difficult. So this problem is called as noisy data and you can have some filtering approach to solve this problem of noisy data. So you can have some filters designed which will be so filters will be some algorithms which will be capable of removing these noisy data points. So this is how you can deal with this noisy data problem then comes the last problem associated with the quality of the data in our discussion is outlier values. So many times in your data you may find that so suppose this is a scatter plot again so for the instances so you will find that most of the instances are crowded in this particular region they are forming one class again remaining instances are forming this particular cloud forming a second class for this particular classification task. So where the instances are divided into two classes person may be a healthy person or person may be a sick person. But now here you will observe that there is one instance present in your data which is far away from these clouds. So if you could locate this is the instance. So if you take consider this is a bivariate analysis so this attribute value is very high this y axis attribute value is again very high and this is neither representing the healthy or sick person and this value is treated as a outlier value and you know this outlier values are significantly affecting the analysis. It is like if you are calculating the mean and if some outlier values are present in your data so then you will lose the significance of that mean itself. So these outlier values need to be identified and need to be removed so as to improve the quality of the data. Now the question is just think how we can detect the outlier values present in the data. So here is one simple but effective method where you can detect the outliers. So here what I will do is I will take the help of statistical descriptors for the analysis. So in that what I will do is these statistical descriptors like the first quartile value and third quartile value I will calculate for the attribute then I calculate the difference between the third quartile value and first quartile value which we call it as interquartile range. So once I have that interquartile range value calculated so then I declare two red lines like the value of first quartile minus 1.5 times interquartile. And upper limit is third quartile value plus 1.5 times interquartile range. So if any data value is taking value in between these two lower limit and upper limit values I will treat it as a valid data point, valid data point otherwise otherwise I will consider that data as a outlier data. Outlier data in the sense that data is too far away from the central values and you can just remove that data point for the analysis. So this is how these are the five important problems we discuss for analysis of quality of the data. So this will come under the preprocessing part so you should deal with the missing values, you should deal with the redundant data values, you should deal with the inconsistent values present in the data. If some noise is present in the data you should attempt to remove that noise and finally if some outlier values are present in your data then you should try to remove those outlier values also. So these are the references used for this video. Thank you.