 Okay, so let's start. As you may already know, this is a three-session series on data pre-processing. Last Thursday we talked about the background, including definition and context of data pre-processing and also some methods and tips for collecting data. We also talked about data integration, like the ways of joining tables as well as data linkage and data enrichment. And today we are going to cover some other common data preparation tasks, including data cleaning, data reduction, and data transformation. Next Thursday, 1 p.m., we are going to deliver coding demonstration on YouTube. We're going to provide some sample Python codes and perform these tasks. If you're interested in this, you're welcome to join. And also if you want to know more about this topic or any other specific part that I didn't mention, we're going to send a questionnaire after the talk, so please leave your comments to help us improve. And just to recap, data pre-processing or known as data preparation is the process of manipulating or pre-processing raw data from one or more sources into a structured and clean dataset for analysis. It is an important part of data analytics. When data has been fully prepared, it is then ready for further analysis like classification, association, prediction, and clustering of data. So generally speaking, it can be regarded as everything you do before the actual modeling. It sounds direct, but it is full of challenges. It is very time consuming and almost accounting for 80% of the data analysis work. And also the motivation is to build good foundation for further actions so that we can get meaningful and accurate results. So there's no clear recipe or standard for data pre-processing. Usually it includes various tasks and considerations. For example, the data integration, data cleaning, data reduction, data transformation, so on and so forth. So today, let's start by data cleaning. Data quality can be reviewed from multiple dimensions, whether the data is complete, whether the data is accurate, whether the data is consistent, and other things. In this sense, data in the real world is often very dirty. For example, it can miss certain attribute values or even certain attributes or contain only aggregated data. For example, occupation could be empty or less obvious example is that January 1st can be everyone's birthday. And also inaccurate or noisy data, which means that data contain errors or odd liars. For example, salary as a negative value, which is very questionable. It can also be inconsistent to when data contains discrepancies in code or names. For example, when age is 42, that birthday indicates that the entity was born in 1997. So here's an example of a very dirty dataset. So firstly, we have missing values. Some are shown as empty, some are shown as null, and some are shown as here as unknown. And secondly, we can see that the data is of wrong format if we want to see the month first and then date and year second. And thirdly, we have some inconsistency data here. So we can see that based on the street name and post code that this entry should be of Manchester, but here it got Edinburgh here. So it's inconsistent with the other values. And the fourth way, we can have some incorrect poster codes. For example, here we got a post code as M88, 5KH, but we all know that M88 doesn't exist at all. And last but not least is we can have duplicate records. This might be kind of hard to identify, but it's just sometimes there. So like here we got two entries that have similar values. So if we put this data directly into use, then it's not likely that we can get useful accurate results. We need to clean it first. So why data cleaning? Well, it's been acknowledged that data cleaning is one of the three biggest problems in data warehousing, and data cleaning is the number one problem in data warehousing. Call it data based fancy data mining algorithms. We do data cleaning to avoid what data scientists call a skechle, which stands for garbage in garbage out. Let's look at the missing data first and the noisy data later. So incomplete data or missing data means that data is not available. For example, many roles have no recorded value for several attributes, such as customer incoming sales data. And missing data may be due to multiple reasons. Firstly, equipment malfunction or inconsistent with other recorded data and that's deleted. It might be done on purpose to survive through some auditing or data was not entered due to misunderstanding or maybe it's not very important at the time of entry or there's no recorded history or changes of the data. We definitely need to deal with the missing values, but there are no easy fix for missing values. There are several solutions and you might find them helpful based on your context. For example, we can throw out the records with missing values. It's a solution, but not much preferred as this creates a bias for the samples and we need to think whether the missing values are missing at random or just not at random. Or perhaps we can delete the column with missing values. That's also a very questionable solution. If the columnist data is unnecessary, we can do it, but otherwise it's inapplicable. Another solution is to replace missing value with a special value, for example, negative 99. Well, usually we don't do this because this resembles any other value to the data analytics. And we can also perhaps replace with some typical values, for example, mean median or mode. That may be possible, but it means possible changes to the distribution. So we need to consider that method. And also we might be thinking about imputed value, like imputed and in this sense imputed values should be flagged. This might be a good way, but use distribution of values to randomly choose a value is always suggested. And another solution is to use data mining techniques so that we can handle these missing values. Well, yes, that is a solution. So for example, decision tree can be applicable. And maybe we find that if the postcode started with M, then it's likely that the city might be mentioned as well. And the last but not least is partition records and the build. That is very plausible if it's possible when data isn't insufficient. So these are just some solutions and inspirations. You can also do some other actions. For example, introduce another data set which contain the missing information you need. But remember to be careful when integrating another data set because it might also create some new missing values. So here's another example of how tricky this task can be. This is a time series data for monthly average sales and monthly average number of customers. And here we see that we missed information for October, November, and December of 2014. Apparently, we cannot just impute the values from the main or medium of the rest of records as we know that December might be a peak, which means it's higher than the rest. And also we cannot just delete those three months as well because they are very important. And it seems that the best way is to get information from the same month of the last year, even though we cannot be 100% sure that 2014 has the same trend as 2013. In terms of inaccurate or noisy data, it means that random error or variance in a measured variable. And similar to the reasons of missing data, incorrect attribute values may be due to faulty data collection instrument, data entry problems, data transmission problems, technology limitations or inconsistency in naming conventions. So how to handle noisy data? We can do baining and smoothing. We can sort data and partition them into bins either by equal width or equal depth. And then smooth the data by bin means, median, or boundaries, etc. We can also adopt regression and smooth the data by fitting the data into a function with regression. We can do clustering to detect and remove outliers that fall outside of the clusters. Last but not least, we can combine the computer and human inspections to detect suspicious values, for example, deal with possible outliers. These are very common. Here, I'd like to talk about baining methods for data smoothing in detail. Suppose that we have 12 sorted data for price ranging from 4 to 34. We can partition them into three frequency bins. If we do smoothing by bin means, then we will have bin 1, which got all the elements as 9, and bin 2 got all the elements as 23, and bin 3 got all the elements as 29. Alternatively, we can do smoothing by bin boundaries, and then we will have bin 1, which got 444 and 115, and bin 2 will get 221 and 225. And for bin 3, we will get another 326 and 134. In this way, we minimize the effects of small observation errors. We also reduce the chances of overfitting in case of small data sets. Other concepts related to data cleaning include data scrubbing, which means use simple domain knowledge, for example, post a code spell check to detect errors and make corrections. And also data auditing, which means analyze data to discover rules and relationships to detect the violators, for example, correlation clustering to find outliers. And also data validating, which means value range checks, regular expressions, uniqueness checks. Unfortunately, usually manual work is essential to cleaning the data, as a lot of common sense hasn't been programmed, for example, the post code. You might have to check on your own to make sure that all post codes are real, correct, and accurate. This usually takes a lot of time and a lot of domain knowledge, and that kind of explaining why data preprocessing can account for almost 80% of the work. And now let's move to data reduction. So why data reduction? Quite simple. In the age of big data database or data warehouse may store terabytes of data, if not more. So complex analysis may take a very long time to run on the complete data set and usually come with significant computational complexity. Data reduction means that people can obtain a reduced representation of the data set, much smaller in volume, but yet produces almost the same analytical results. Firstly, data reduction is almost always possible during the data integration. Redundant data is often created when integrating multiple databases. It can be either column oriented, as the same attributes may have different names in just a different databases, or may be raw oriented for duplicated entities, etc. Therefore, careful integration of the data from multiple sources may help reduce or avoid redundancies and inconsistencies and improve the efficiency and quality of the data analytics, or we just need to remember to check the shape of merged data set at the end. There are two data reduction strategies. First is dimensionality reduction. We can remove data with poor quality, for example, many missing values, large number of categories, zero variance, or we can remove redundant and irrelevant attributes or variables, or maybe we can do principal component analysis, or PCA, variable clustering, and or feature engineering. Second is numerosity reduction or data reduction. We can adopt the sampling techniques or do regression log linear models, so to just keep the model and abandon the data, or maybe histograms and clustering. These methods are less common, but still solutions to deal with large size data. Here's an example. We can see that the input X2 has the similar information as input X1, and they are positively correlated. If X1 has certain relations to the output, then we can just use X1 to approximate the relation between X2 and the output, and vice versa. In this sense, one of them is actually redundant, and we just need to keep the other one. That is correlation analysis. Correlation between two variables, X1 and X2, is the standard covariance obtained by normalizing the covariance with the standard deviation of each variable. Sample correlation for two attributes, X1 and X2, can be computed with the following equation, where n is the number of samples, mu1 and mu2 are the respective means, and sigma1 and sigma2 are the respective standard deviation of X1, X2. If row12 is greater than zero, that means X1, X2 are positively correlated, which means X1's value increase as X2's value increase. If row12 is zero, then that means the two variables are independent from each other. If row12 is smaller than zero, that means they're negative correlated, which means the X1's value decreases as X2's value increase. Correlation coefficients of value usually range from negative one to one, and these scatter plots demonstrate sets of points, and their correlation coefficients change from negative one to one. So we can see that if the correlation coefficient is zero, which is in the middle, that means the two variables are independent from each other, and we cannot assume the other variable from one variable. So different methods are used for testing correlation between different types of variables. If both variables are continuous variables, then we use correlation analysis, but if both are categorical variables, then we use chi-square test. We can also use ANOVA or linear discriminant analysis if one is categorical and another is continuous. Another way of variable reduction is principal component analysis. Principal components are constructed as mathematical transformations of the input variable. Each is an uncorrelated linear combination of original input variables. The components can be written as the sum of loadings onto variables. For example, principal components 1 is equal to A1X1 plus B1X2 plus C1X3, where A1B1C1 are called as loadings. The coefficients of such linear combinations are the eigenvectors of the correlation or covariance matrix. The principal components are sorted by descending order of the eigenvalues, and the eigenvalues represent the variance of the principal components. So PCA is by far the most popular dimensionality reduction technique, especially if there are many, many attributes, like if we have millions of attributes, then we can just use 100 principal components to capture their variance, as a small number of principal components can be captured to explain a lot of variations in the data cloud. The disadvantage is that it's difficult or impossible to interpret the constructed principal components. In terms of numerosity reduction, on the one hand is non-parametric methods, where we do not assume models. Common methods include sampling, clustering, histogram, etc. On the other hand is parametric methods, where we assume the data fits some model. We estimate the model parameters, we store only the parameters and then discard the data. For example, regression, log linear models. For the reason of time, I can just expand on sampling methods, which is the most common one, I think. Sample means obtaining a small set of samples to represent the whole data set, assuming that the computational complexity is potentially sublinear to the size of the data. So simply random sampling means that there is an equal probability of selecting any particular object, and sampling without replacement means that once an object is selected, it is removed from the population, whereas sampling with replacement means that a selected object is not removed from the population. Last but not least is stratified sampling, where it is required to approximator the percentage of each class in the overall database before drawing the samples. It is usually used in conjunction with skewed data. So that's it for data reduction. The last section in today's talk is data transformation. Data transformation is a function that maps the entire set of values of a given attribute to a new set of replacement values, such that each old value can be identified with one of the new values. It is very helpful and necessary if you have features at different scale, for example, the weight and income of a person. Many actions like PCA require normalization in advance to guarantee their performance. Relevant methods include normalization or standardization, which scales the data to fall within a smaller specified range, including min-max normalization, z-score normalization, and normalization by dismal scaling. There are several different data transformation methods, each for different purposes. For example, standardized numeric values, so that all numeric values are replaced by the notion of how far this value from the average. Data transformation is useful, although it sometimes has no effect on the result, such as for the decision trend for regression. We can also change counts into present-age, translate days into durations, capture trends with ratios, differences, etc. And also we can replace categorical values with appropriate numeric values, as many techniques work better with numeric values. So the example on the left is a transform year into age, if the current year is 2018. And on the left, we can see that the rating A to G can be transformed to the median scores of the respective band. We can also transform data to bring information to the surface simply by using the mathematical functions such as logs, reciprocals, or square roots. For stretching or squishing data, here's another example of using logarithm to zoom in the pattern of the data. So we can see that before all the points are just clustered on the corner, but after the logarithm, it becomes more clear. One-hot encoding is a very common data transformation, where binary variables are used to replace the categorical features. So some algorithms can work with the categorical data directly, for example decision tree, but many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric. This means that categorical data must be converted to a numerical form. For example, here the four categories are transformed into four new attributes cat, dog, turtle, and fish. If the categorical variable is an output variable, you may also want to convert predictions by the model back to a categorical form in order to present them or use them in some applications. So mean max normalization is one of the most common ways to normalize data. For every feature, the minimum value of that feature gets transformed into a zero, and the maximum value gets transformed into a one, and every other value gets transformed into a decimal between zero and one. So for example, if the mean of income is 12,000 and the max is 98,000, and then we map the two boundaries on zero and one, and thus 73,600 is transformed to 0.716. Now the normalization is z-score normalization. The z-score normalization is another common method. Simply put, a z-score called a standard score gives you an idea of how far from the mean a data point is. But more technically, it is the distance between the raw score and the population mean in the unit of the standard deviation. So given the example in the last case, the mean is 54,000, and the standard deviation is 16,000, therefore the income 73,600 is transformed into 1.225. Last but not least is normalization by decimal scaling. It is relatively more straightforward. The new value is equal to the old value divided by 10 to the jth power, where j is the smallest integer such that the max of the absolute value of the apostrophe is smaller than 1. So for example, given a set of values ranging from negative 772 to 821, we can divide each value by 1,000, and the negative 21 is normalized to negative 0.028, and 444 is normalized to 0.444. So that's it. And at the end, I want to say that some of this content is based on the data pre-processing data mining, a book published in 2015, and also some of the material is based on the lectures I get from Dr. Yu Wanshan from his understanding data and the very environment data pre-processing in 2019.