 In the previous video we saw the major tasks in data processing that is cleaning, integration, reduction and transformation. In this lecture we are trying to understand some of the major tasks in data processing in depth. Say for example, what is missing data? Let us start with missing data. So what is missing data? Missing data is defined as the value or the data that is not stored or present for some variables in the given dataset that is called missing data. Now how to find a missing data? If the dataset is too small, if the data size is too small, then we can find the missing value manually that is one way and if it is a little large then we can write a code like is null not null, nn is not a number in pandas you can directly check it use it and is it sufficient doing manual or writing this sort of simple codes to understand the missing data. The answer is not always. So these are only useful if you have a limited set of missing values or the data size is too small. So if it is very large, then how to address this issue? There are various ways in which we address the missing data. To understand that we need to know why this missing data will come into picture. Now why there is a missing data in a database? There are various reasons with which this can happen. It can happen as the data may get corrupted due to the improper maintenance that is comes under system error. Now it can also happen that we fail to record a certain point due to human error. So as we discussed earlier slip mistake and violations. So there can be a slip or mistake or there can be a violation where actually the user does not provide the information intentionally. So these are the three major reasons where actually it causes a missing data in a given dataset. How to handle it? To understand that we have to know why we have to handle the missing data. Most of the algorithms that is the machine learning algorithms fails if the dataset contains missing values. Some algorithms are there like K nearest neighbor and Naive base where actually even if the data is missing it works. But most of the algorithms fails if the data does not contain proper information or it contains a missing value in the dataset. Also you may end up building a biased machine learning model where it may not give you the result which is actually the accurate or the optimal one. There are cases where if it does not get trained properly there are chances that the entire efficiency of the model may go down. For example there can be a lack of precision or the accuracy or the recall of the given algorithm because there was a huge amount of missing data in the algorithm. So these are the reasons why we have to handle the missing data. By handling the missing data we will get a model or a algorithm which gets trained well which is more optimal, robust and also there are cases that it can make it efficient as well. Now with this we will try to see how best we can handle the missing data. We can do two things one is either we can delete or we can add the data. Deleting the data we cannot delete the data if it is too small. Similarly there are various cases in which deleting is not a best option. Now when deleting when we now when we cannot delete the data what we can do is we can fill the missing values that is imputing the missing value. Imputing means assigning a value to something by inference. So there are two cases. If the data set is extremely large then we can delete the data and missing values are very less then we can do this. If it is small we cannot do that. Now imputing so the other way around if it is too small then we can impute if it is too large also we can impute if the imputing techniques makes more sense for the given data. So what are the various ways in which we can do the imputing. Deleting we know in the data cleaning phase where actually we delete the data which is not required. Now imputing how to do that. There are several ways in which we can do imputing. Before we understand what are the various ways in which we can do the imputing we have to know what type of data is missing. There are three categories what is MNAR, MCAR and MER. These are the three types of missingness. We have to understand for our data or the given data what is MNAR which variable is MCAR or which attribute or variable is MER. MCAR means there is no relationship between the missing data and other values observed or unobserved the data which is not recorded within the given dataset. So the missing values are completely independent of other data and there will be no pattern that is MCR missing completely at random. For example let us say the results obtained from a student based on the place he lived before. So it does not make more sense right. So most of the cases where we do the analysis it is independent of the region where he is born or something like that. So this can be one of the example for MCR missing completely at random. Now what is MER, MER is missing at random. So means the reason for missing values can be explained by a variable on which you have complete information as there is some relationship between the missing data and the other values of the data. What you mean by that is say for example you have a demographic data. In the demographic data you have asked various questions. In one of the questions is gender and one more question is age. And you observe that in gender if it is female they might not have filled the age one. It is not always true but there is a dependency. So females does not fill age always. So there is a dependency. This dependency comes under MER that there is a some relationship between the missing data and other values of the data that is MER. Now the third one is MNR, MNR stands for missing not at random. So the missing values depend on the observed data and if some structure or pattern in the missing data and other observed data cannot explain it then it is missing not at random. For example there is an income field in one of the forms where they want to know how many people get the income from how many sources. What are the different sources they get the income let us say. And if most of the information provided by them contains the missing there are chances that they may be having many income sources. So we cannot neglect this type of data nor we can impute from anything else. So in this case what we have to do for that we have a classification say for example the complexity. So this is taken from one of the literature where they say if the complexity of the given problem is very less in that case for MCR and MER we can ignore missing at random miss some completely at random we can ignore. If it is MNR when we cannot ignore then we have to do something. If the complexity is very high then you cannot even ignore the missing at random part. So this is missing at random. If there is such case we have to do some imputation maybe a single value or a multiple value or attribute value or attribute will be there so we have to do the imputation. If it is MCR in any case we are not worried we just we can just delete the data and then we can proceed. If it is an MNAR then either you have to find the data recollect the data or you have to see what is the best possible way in which you can address it. So even the data pre-processing technique may not be sufficient enough in some of the MNAR cases where actually the data has huge dependency on the problem which you are solving. And if that particular variable or the attribute itself is missing you have to find a way to find the best data itself rather than doing some pre-processing techniques. Now we will see what are the imputations we can do for MER and MNAR. In MNAR also you can do some pre-processing if the complexity is less or the dependency is little less. We will see what are the various ways in which we can fill the missing data. Now if you have a missing value if the data set is extremely large then what you can do? One of the instance does not contain the required information. You can delete the entire instance that is deleting the entire row this is one possibility. Then if one of the attributes is missing for all the instances in that case we can delete the entire column that is the second case. If you have rows and columns both some data in rows are also missing some data in columns are also missing then it is fine if it is less we can do. But if it is more you have to make sure that you will not end up deleting the whole data. You can do it to some extent where it is still meaningful if the entire data becomes biased or it conveys some other meaning then we have to stop doing that. If it is one or two instances we can delete the entire row or we can delete the entire column if very less if very less instance are there we can delete both rows and columns. But we have to make sure that we do not end up deleting the entire data. Now that is about the deleting. Deleting is quite simple. Now let us see what is imputing. Imputing is adding a data adding some value or filling some value in that missing value place by using some methods. There are various methods not all methods are correct for a given set of data or a problem which you are solving. Say for example replacing with an arbitrary value you have a data in that one of the columns is almost having a missing data but it is important and what you can do is say for example in this case where actually the corporate work experience of an under graded student we can set it as 0. It is rare that they will have the experience at that level. So we can directly set it to the 0 and then we can use it. So here it makes sense. Whenever we can use such type of arbitrary values we can replace it with a arbitrary values. Similarly for example the number of married students in a high school we can arbitrarily set to 0 because that is the general case it is very rare that we get some numbers in that. So in such cases we can fill with an arbitrary values. Apart from that if the data is missing in some numerical columns then we can use mean. We can replace the missing value with a mean. If it is a categorical feature, categorical feature means so we have a categories like we are classifying dog, cat and elephant so these are the categories then we can use mode. We can replace the data with a mode. If your data contains out layers, if you do not solve the out layer issue then you can use median. You can simply use the median and fill the data. These four cases cannot be used in few cases like for example when there is a time series data what you mean by time series data the temperature change in one hour or the temperature change in a 24 hour duration from 12 o'clock midnight to next day 12 o'clock. So if you are calculating the temperature or if you are storing the temperature then you cannot fill with a mean, median mode or some arbitrary values then we use the previous data or we use the data which is next. From the next value we can take the previous value or from the previous value you can fill. So if you are taking the previous value and filling it it is called forward fill. So if you are taking the next value from that if you are filling it is called backward fill. So in the time series data generally we use forward fill or backward fill. So we saw how to replace the data. There are various ways one is with arbitrary value then with mean then with mode and then with medium and then with forward fill or the backward fill. Now with this we will see what else we can do when it is a categorical data. As I told earlier one of the ways to do it is use the most frequent value and use it in a categorical data. This is one way which we already discussed say for example if you have three classes in a category and then if you are filling it you can use the most frequent value. Apart from this if the data is huge and you are unable to do that then we can also do one separate category for these missing values. There is something called data encoding which we learn in the coming lectures where I will try to explain how the categorical data is converted into numerical format. So in that case what happens is if you are using a separate category for a missing data then one more column will get created in one of the encoding techniques that is one hot encoding. You just remember this in this slide I will connect it back when we discuss the one hot encoding. Now this can be done for one attribute or this can be done for multiple attribute or this can be one done for one instance or it can be done for multiple instance. Based on that we have a division univariate and multivariate. Univariate is just one and multivariate is more than one. When you have to use the various methods of missing values you have to see what you are addressing whether you are addressing a univariate or you are addressing a multivariate. Based on that there are several approaches we have to choose the best approach for the given data or the given problem. Now each one has its own merits and demerits. Let us discuss this. If you are deleting then what are the positives? The complete removal of data with missing value results in robust and highly accurate model that is quite obvious right. So we do not have any ambiguous or we do not have any inconsistency because all those data are deleted that is one case. And also deleting a particular row or column with no specific information is better since it does not have a high weightage. There is some loss of information though this is the demerit. So some loss will be there in the information as well as the data. We may not know whether it is really making sense or not at that point of time but loss will be there. Now the next thing is it works poorly if the missing value is very high. So as I told earlier if the missing values are very large or if it is a very small data set and you are deleting huge amount of data in that then it will not make sense. That is the demerit. Now the next one is replacing it with mean median mode. One of the merits is this is a better approach when the data size is small and also it prevents the data loss which results in removal of rows and columns. The demerits is imputing the approximation adds variance and bias and it works poorly when compared to the multiple imputation methods. Now here there are two terms adds, variances and bias. This you will understand in the separate set of video lectures additional videos where they will explain what is variance and bias. Just to give you a brief view so this is the data and this is the collected data. Now the predicted data and the collected data are in the same one. If it is same then it is low variance and low bias. Now the data moves ahead from the actual data this becomes low bias but high variance. Now they take loss but they are too much away from the actual data that becomes high bias low variance and in this case you can see high bias and high variance. So our algorithm should be very optimal in this bias and variance. We have to understand that bias and variance should be exactly as the data is. So if it is too high too low or if it is too much different then the method may not work properly. Now we will see what is the demerits of assigning a unique category. Now you can see assigning a unique category actually brings in less variance. So it brings less variance. So to understand that we will understand in the multimodal learning analytics video lecture where we will see adding one more category how it actually decreases the variance and adds another feature to the model while encoding which may result in the poor performance. So this is the demerit of adding a unique category. So I told you the example of one heart encoding where he adds one more column to that. So when we discuss that part I will discuss again this in detail and what are the merits less possibilities with one extra category resulting in how low variance after one heart encoding since it is a categorical one. It negates the loss of data by adding a unique category. For each case we will try to understand with an example when we clearly learn the classification, multimodal analytics and affect computing videos where I will take the same techniques and tell you what does this merits and demerits mean in detail. Now the next one is predicting the missing value. So if you predict a missing value what happens is it yields an unbiased estimate for sure that is the one of the merit. Also imputing the missing variable is an important as long as the bias from the same is smaller than the omitted variable bias. So these are the two cases if it is there then we can use this or else what happens is it always consider itself as a proxy for the true variables. The second thing is the bias also arises when an incomplete conditioning set is used for a categorical variable. We can also use some support based method to fill the missing values. What happens with this is it is very time consuming process and also the choice of distance functions that can be Euclidean or Manhattan which will not yield a robust result. But there is always a merit when you use such type of data where it requires support mechanism. So it does not require creation of a predictive model for each attribute within the missing data set. So you are not creating your own predictive model which runs again with some time based on the data and also the correlation of the data is always neglected when you use the support missing value. What do you mean by prediction and support? Since I did not discuss how to predict the values and also how to use the support system I am trying to give you the notion of how it works. This is from a research article where actually they tell what are the various ways in which we can handle the missing data. We saw the conventional methods also we saw some univariate methods median mode and all those things but there are several other methods which are also used to fill the missing values. Say for example there are statistical methods, there are some machine learning methods, also there are likelihood based methods and there are similarity methods. So now we can use any type of method to fill the missing data but we have to see what is the best for our data or for our problem and using this should not take a lot of time, the data pre-processing itself should not take a huge amount of time so that the entire processing becomes slow. So we have to choose the best possible way to address the missing data. If the data collected by you or the problem which you are solving requires any such type of algorithms which are apart from the basic conventional methods or the univariate methods you can use machine learning methods or the likelihood based methods or the similarity methods but it will take its own time you need to understand each and every method the pros and cons of that method and then you can use those methods. So I will not be discussing all those but I will give you the brief idea what are the various methods available just to fill the missing value. To summarize in this lecture we saw what is missing data, why to handle missing data and also we discussed about the what are the types of missingness MNAR, MAR and MCAR. So based on the complexity we can ignore or we can use some imputation or we might have to collect the data again and what are the various ways to handle the missing data. I have shown you the entire image of what are the various methods available from the literature you can also choose the best possible method for a given data or problem and then accordingly you can use the missing data.