 So welcome everyone to data preprocessing, this is the first talk, and today we are going to talk about the some introduction about this concept and something about data integration. My name is Ann Wren and I'm a researcher associate at Kathy Marsh Institute and UK data service. I'm based in Manchester and my email is attached here. If you have any questions, you can come to the Q&A session later or you can just send me an email after this talk and I'm very willing to help you. So this is the table of content, there are two talks and one code demonstrations on this topic, one talk on today and another one on next Thursday. Today we are going to cover the definition, context and collecting data and also data integration meaning on joining tables. And next Thursday we are going to cover more on data cleaning, data reduction and data transformation. On the Thursday after next Thursday we are going to deliver a live streaming code demonstration on YouTube. We are going to provide some sample Python codes and perform these tasks. If you are interested in this, you're very welcome to join the next two sessions. Also if you want to know more about this topic or any specific part that I didn't mention, we are going to send a questionnaire after the talk and please leave your comments there to help us to improve. And also although we cannot hear you during the talk, we will have a Q&A session later and answer the questions then. But meanwhile you are welcome to leave your questions in the Q&A box throughout the talk. My colleague Julia and Joe will answer them directly there if it's not too complicated or they might leave it to the Q&A session so don't worry. So without further ado, let me give you a definition of data preprocessing. Data preprocessing or known as data preparation is the process of manipulating or preprocessing raw data from one or more sources into a structured and clean data set for analysis. It is an important part of data analytics. When the data has been fully prepared, it is then ready for further analysis like classification, association, prediction and clustering of the data. There's no clear recipe or standard for data preprocessing. Usually it includes various tasks and considerations. For example, selecting and acquiring data to use, integrating different data sources together, conducting exploratory analysis, and also cleaning and repairing data like dealing with missing data, inconsistent data, and also data reduction, transformation, etc. So generally speaking, it can be regarded as everything you do before the actual modeling. And Harvard University labeled the data analyst or data scientist as the sexist job of the 21st century. It seems like in these days everyone was a data scientist and for good reason. According to LinkedIn, the career has seen exponential growth and becoming the second fastest growing profession. A huge amount of data is generated each day, but many companies are still facing obstacles when it comes to utilizing their data correctly. And some are not even sure of what to do with their data science team. Here's a graph of one day in data. You can see that around 500 million tweets are generated each day. And on Facebook, around four parabytes of data generated, including 350 million photos and 100 million hours of video watching time. So it is fair to say that the popular phrase, big data is not an exaggeration, we are facing a real data deluge now, which means that the amount of data being generated is overwhelming the capacity of organizations to use them. And the organizations all know that the raw data itself is not a treasure. Only if the data is attached with good preparation and proper analytics, it can then provide actionable knowledge so that we can get insights or findings that we want. The companies in real world, especially the big international ones, put much effort in thinking about how to use data to provide more service around the data. For example, you may have heard of Google Trends. Some scholars use it to track the change of popularity of something that they want to study. And also digital banking is something that where you can find a personalized page or reports within the bank apps in your cell phones. And also YouTube analysis, many YouTubers use it to see what kind of content will make them lose their audience so that they can avoid these kind of contents next time. And also a typical example is Amazon Recommendation. When you log into your Amazon account, you can always see a lot of automatic recommendations for you that allure you to make more orders. They are all based on your past searches and purchases. So in some sense, we can see that the world nowadays is full of pitfalls. As your data is almost recorded every second, the businesses will collect the data, analyze them, and then create more strategic presentations or products to attract your attention. And you find yourself spend more time and money on them. And without doubts, this phenomenon is going to be more and more severe in the following decades. Do everyone needs to be prepared for this big data change? Well, it should be minded that although all these applications sounds pretty direct and easy, there's a big gap between the expectation and the reality of this work. I think this picture is a really good presentation. Real-world data analytics is full of challenges, and some challenges might be very time-consuming or even impossible to overcome. Rather than developing the fancy algorithms or plot the eye-catching figures, much time and effort are spent on data collection and preparation. Forbes found that data preparation accounts for about 80% of the work of data scientists. In particular, 19% is spent on collecting data sets, and 69% is spent on cleaning and organizing data. And other tasks like building training set, collecting and mining data for patterns and refining algorithms and others, they only account for a very little in comparison. It is not unfair to say that data preparation is crucial for getting meaningful results from data analytics, not enough time. So it's sufficient to prepare the data list to garbage in and garbage out, which we also call it as GIGO. It is a computer-sized acronym that implies bad input will result in bad output, which is the message of this picture. Well, it says that if your business is a house, the data is its foundation. So before cleanly seeing, the data quality might be only 40%, and if you do anything on it, you might get wrong prediction, so you don't have any meaningful results or any meaningful message from it. However, if you do proper data cleanly seeing, then data quality can improve to 90%. Then you can build your house on it, you can get right prediction on it, then you get the meaningful insights and findings. So although data preprocessing may not appear in your research plan or in your methodology plan, but please don't forget it or underestimate it. It is very crucial in your research or in your data analysis. How to check data quality? There are many ways to check the quality of your data, and here I'd like to cover two common ways. One is to use metadata. Metadata is the data about the data. The motivation is to better understand the data, including its availability, types, quantity, complexity, and other things. Another one is exploratory data analysis, or EDA. EDA is widely used in data analytics, and one of its functions is to check data quality. It is an approach to analyzing data sets to summarize their main characteristics and often with visual methods. EDA is primarily for seeing what the data can tell us before formal modeling or hypothesis testing tasks. For example, if I have cells at hand and I want to build a model to predict the performance of next month, then maybe I will start with EDA to do some random explorations, such as creating histogram to see whether there are any regions out from the rest, or if maybe we can have a clear higher income. Among others, the source of data is very crucial to determine the data quality from the very beginning. There are many ways to collect the data. If you run your own business, which generates data itself, then you might think about building a data warehouse to store the data and organize it as you wish, and the data can be easily retrieved when needed, but that usually costs a lot. Or maybe you can script data from a website to collect the raw data yourself, but that usually takes longer time to pre-process it, as the data tends to be very dirty. For example, if I scrape the data from a hotel page, then the price may be shown as a piece of text rather than a number. Then I need to change it later. But the good point is that if I do web scraping, then I can customize the data structure while collecting the data. Alternatively, you can use the ready-made data sources, for example, Statista, which is usually for business data, and also WordBank's UKDS, which is UK Data Service. You can find a lot of open source data on the UK Data Service website, including census data, international macro data, qualitative data, and survey data. You can also access data from other providers and use the UKDS API to customize the layout as you want and then export the data. The data is usually clean, and this may save much time from data reduction integration later. And there are detailed instructions on how to access and use the data. So that's it for the context and for collecting data. And then you may ask again, what do we do in the typical data pre-processing after we have the data at hand? Well, major tasks in data pre-processing include data integration. That means the integration of multiple databases, data cubes, or files, and data description summarization and visualization, which is also included in the EDA part. And also data cleaning, which means a fill in the missing values, a smooth noisy values, and identify or remove outliers and noisy data, and resolve inconsistencies. And data reduction, which is usually to reduce the representation in volume, but produce the same or similar analytical results. And data transformation, which is usually the normalization aggregation. And last but not least is data discretization and data generalization. That means that you maybe categorize the continuous data into different categories so that we can build a model on that. But it should be very in mind that, as I said, there's no recipe or standard for data pre-processing at the moment. So there's no certain order of doing these steps. All these tasks are just the most common ones. In fact, it's quite often that you will have to repeat some of them multiple times. For example, after reading the metadata, you clean the data to deal with the missing values at the beginning. And then after joining the tables, you find some new missing values are created, and they need to clean it again. So in the rest of today's session, I'm going to talk about the first entry, data integration. And for the next Thursday, I'm going to cover how to clean, reduce, and transform the data. By definition, data integration is combining data from multiple sources into a unified view. The benefits include to improve data quality, to enrich data with additional information, and to allow reliable data analytics and beyond. Data integration is a kind of unavoidable step to some extent, as it is not common that all information will be stored in one spreadsheet. For example, I have a data set on hotel information, including the hotel's name, city, star level, etc., which is very static. And also I have another data set of the reservation, including the customer's name, hotel name, payment methods, price, and others, which is always updated. And if I want to find which features of the hotel determine the hotel's popularity, then I will first need to link the two tables based on the hotel name or hotel ID. So this is a very typical data integration. Usually integrating in-house data within data warehouse together is relatively straightforward, with common attributes and structures across the schemas. For example, joining the tables, as the name indicates, joining the tables enable you to extract and simultaneously process data from more than one table. There are four types of joining, inner drawing, full drawing, left drawing, and right drawing. Notice that the methods are quite common in various applications and languages, so SQL and Python may have different joining commands, but you can always select which join type you want to use. And also, you should remember that recheck data quality after the joining, because as sometimes new missing values or unexpected roles might be created during the joining process. And if we do not recheck the data quality, then maybe after the joining, the quality of the data might even be reduced. So the first type is inner join. By default, the joining query performs an inner join, which includes matching roles only in the results. So for example, we have employee payroll here and employee organization here. And we can join these two tables on the column employee ID. And only the roles that are presented in both tables will be kept in the merged data set. So this is the inner join. If a role, if the employee ID only appears in one table, then it will be dropped in the final data set. By contrast, a full outer join includes all roles from both tables, but it will create a lot of missing values. For example, the role fifth in the merged data set, it doesn't have the department on the second table. So that field is missing. And similarly, the last role in the merged data set, it doesn't have the salary in the first table. So that is another missing value. A left join is a bit different. A left join includes all roles from the left table. And by left table here, I mean the table that you put first in your command. And similarly, a right join includes all the roles from the right table, which is the table you put on the second place in your command. So these two kinds of joining is very sensitive to the order of the tables in the command. So we need to be careful on which table we put first. Data integration is not an easy job, as there are many difficulties. First is the database heterogeneity, including system heterogeneity and schematic heterogeneity. System heterogeneity occurs when using different operation system, hardware platforms. And the schematic or structural heterogeneity is the cases where the native model or structure to store data differs. So for example, the SQL database versus the NoSQL database, if we want to combine the data sets from the two different database, then we need to think about how to make them compatible with each other as they have different structure. The second issue is about detecting and resolving data value conflicts. For the same real-world entity attributes values from different sources at different, including different representations, different scales, for example, metrics versus British units. Last but not least is the entity identification. We're integrating data, we may need to identify entities from multiple data sources, but the same entity or attributes may have different names in different databases. For example, Bill Clinton and William Clinton actually refers to the same person, but they may have different presentations in different databases. We will discuss this later in the data linkage. A very vivid example is the health surveillance system, which is quite related to our situation now. Preventing the outbreak of epidemics requires monitoring of occurrence of unusual patterns of symptoms in real time. Data from many different sources need to be collected, including the travel and immigration records, doctors emergency and hospital admissions, drug purchases in pharmacies, animal health data, etc. Such databases can have different formats and codings, and they usually contain data of different quality. Volume is another issue, as they can contain millions of records. And another issue is the privacy and confidentiality. If such data are stored and linked at the central location and without proper regulation, then there might be some privacy issues. Another two parts of data integration is data linkage and data enrichment. They're relatively more complicated and not always needed, so I won't go too deep into it, but I think it's good to know about them in case that you may encounter them in some days. Data linkage is the process of bringing together information from two different records that I believed to belong to the same entity based on matching variables. It is also known as record linkage data matching and entity identification or others. It is a challenging task if there are arrows in their key variables that are used to link the data. There are two kinds of linkage. First is a deterministic linkage. Records must agree exactly on the key attributes in order to conclude that they correspond to the same entity. It can be used when high-quality identifiers such as ID numbers are available. Another kind is probabilistic linkage. It doesn't require that all key variables match. Rather, it involves frequency analysis of the data values, which helps to calculate the weight that indicates how likely that the two fields refer to the same entity. Sometimes there are some uncommon value agreement or very strong evidence for linkage, and then they need to be included into consideration manually. For example, here we have two datasets and we need to think about which of these records represent the same person. Based on the deterministic linkage, we can set the SSN number as the reference and then the row 1 in set B will be assigned to the row 3 in set A because they have the same SSN number. However, the row 2 in set B will fail to find a match because the SSN number is missing in that row. But if we are adopting the probabilistic linkage, we can find the row 2 in set B, a replication of the row 1 in set A, as they share the same date of birth, sex, and zip code. Although the name is not exactly the same. Another thing is the enriching data. So data enrichment is the process of introducing more data. It can be disparate data from other internal sources or third-party data from external sources. The benefit is that first, it can improve the contextualization with additional data. And second, it helps enrich or validate our data. For example, to get a better understanding of advertising effectiveness, a company can enrich its internal sales data with a third-party advertisement data. Enriched data is a valuable asset for any organization because it becomes more useful and in set 4. A majority of brands conduct data enrichment on their raw data so that they can use it to make informed decisions. In 2018, data enrichment grew by 80%. So that's it for today's talk. And at last, I want to say that some of the content based on this book, Data Pre-Processing in Data Mining, which is published in 2015. So if you are interested in this and want to find more material, you're welcome to consult this book.