 In the previous video, we understood at an abstract level what is data pre-processing. Now we will understand in detail why data pre-processing is required. Data pre-processing is required mainly to check the quality of the data. Among them we need to know accuracy, the first point. Accuracy is used to check whether the data entered is correct or not. Completeness is the second point where actually the data is available or we recording all the data or the entire data or not that is about the completeness. Then consistency. So, are we keeping the data at the same position say for example, we have n attributes are we keeping each attribute at the same position or not that is about the consistency. And then timeliness. Timeliness is if the data is collected over a period of time whether it is getting updated and getting stored correctly or not. Then two more are believability and interpretability. Believability talks about whether the data can be testable, whether the data is testable or not. Interpretability talks about whether we can understand from the data, whether we can interpret the data, whether the given data or the available data which we are collecting, whether we can do any interpretability or whether we can understand from that given data or not. Now to understand the accuracy, completeness, consistency, timeliness, believability and interpretability we will do various tasks in the data pre-processing. We are majorly classified into four sets that is data cleaning, data integration, data reduction and data transformation. These are the four major tasks in data pre-processing. Now we will see. What is data cleaning? One of the simplest example is finding an outlayer. If there is an outlayer then we will delete it or do something about it. So that is data cleaning. Data integration is you are storing data or your data is coming from different sources and stored in different databases. Now can we combine it to form a one data and from that data can we do the processing. So that is data integration. The next is data reduction. In data reduction what we do is there will be several attributes and there will be several instances. We will convert this data in such a format that the information which we get from that will be the same but the data will be reduced. So there is two ways we do this. One is dimensionality reduction, another one is numerosity reduction. We will not be discussing that in detail in this class in this video lecture but if you are interested in understanding the data reduction you can go and check for the dimensionality reduction and the numerosity reduction. Now data transformation. Given data in its format may not be useful for the method to understand and to do the analysis. In some cases where we are supposed to convert the data into some other format maybe we are converting it using some scaling methods where we scale the given data to a particular scale maybe range from 0 to 1 or minus 1 to 1 something like that. There are various methods used in the data transformation we will discuss a few. We will start with the data cleaning. So data cleaning routine works to clean the data by filling the missing values, smoothing noise data, identifying or removing out layers and resolving inconsistencies. Data cleaning does these four operations predominantly. One is filling the missing values, smoothing the analysis data, identifying or removing the out layers and resolving the inconsistencies. Now let us take an example to understand one of the cases. Let us say we have a data of five instances and we already have the data of gender. Now we have collected the data with respect to the whether they are pregnant or not. From the collected data it is observed that in two instances mail for the gender mail it has come as pregnant which is actually an inconsistency in the data. If we train the model or a method machinery method with this it may give us the ambiguous result. So now this inconsistent data how to avoid this, what we can do? So one way to do this is you just remove those rows which contains the inconsistent data. These are the two set of data which is actually inconsistent we can remove these two. If we remove these two one two and four only these three will be used as a data for the processing. So this is one of the solution to do that. Can we do this all the time? If the data is huge when we can delete the data directly that is possible that is data cleaning. One of the ways in which we can do data cleaning is just by removing it. But if the data set is very small and most of the attributes have one or the other missing values then how to do that? For that we need to understand few other concepts. One is data editing we know that we collected the data for pregnant we know that mail cannot be pregnant we will change the value from mail to female. This is called data editing where we edit the data and then third one is data reduction. So we are worried more about the female the from the data which we have collected they have entered male or female and those who entered female that data will only be considered and from that we are doing the analysis. We are reducing the given data in such a format that we are using only the information which is required for the given problem and we are not worried about the other information. So two and four are the data where they have entered female. So we are using only those two data for the analysis. This is one of the simple examples of data reduction. So data reduction is the transformation of numerical or alphabetical digital information derived empirically or experimentally into a corrected ordered and simplified form. So you need to know data reduction obtains a reduced representation of data set that is much smaller in volume yet produce the same or almost the same analytical results. So I already told you there are two things one is dimensionality reduction and numerosity reduction. So you can go to the literature and you can find the information about those two dimensionality reduction is if it is some n dimension. So I can convert it into the how many dimension maybe one dimension or two dimension values and then I can do the analysis extract the information from the method. Now this is data reduction. Now let us go to the next part that is data transformation and data integration. What you mean by data transformation? In data transformation we are transforming the given data into some other format without losing any information or without modifying the given information. So that is data transformation. One of the techniques used in the data transformation is normalization. Normalization is the method of scaling the data so that it can be represented in a smaller range say for example from minus 1 to 1. Normalization is it the only technique to do the data transformation or scaling is the only technique in that data transformation. No there are various other things that we will discuss in detail. There are various possibilities yes and so for that. Now we will see what is data integration? Data integration is you are having from different dataset this is one of the simple example where we have data from different databases and then how to integrate in such a way that same information is actually there in the collective dataset. The collective dataset also conveys the same information which is there in the different set of databases. For example you want to understand the student ID from one database and you want to take the student name from the another dataset. Now the same student's name should come for the student ID. From these two how can we actually combine in such a way that it is consistent. So this problem is called entity identification problem. This comes up data integration. So is this the only problem we face in data integration or is this the only problem we solve in data integration? No this is one of the problem we solve in data integration but this is a dominant one. Now you understood the various tasks in the data preprocessing. To summarize so in this video we understood why data preprocessing is required? What are the factors where it shows the requirement of data preprocessing? And then we saw what are the major tasks in data preprocessing? We saw what is data cleaning and also in that we saw what is data editing. Then we understood what is data integration? In integration we saw entity identification problem then we saw data reduction. In data reduction we saw how we can give for a simple example and also we know the term dimensionality reduction and numerosity reduction. We can also try to understand what is data reduction from that and then data transformation. In data transformation we saw normalization where we scale the given data within a given range.