 In 2014, Amazon started working on its experimental ML-driven recruitment tool. Similar to the Amazon rating system, the hiring tool was supposed to give job applicants scores ranging from 1 to 5 stars when screening resumes for the best candidates. Yeah, the idea was great. But it seemed that the machine learning model only liked men. It penalized all resumes containing the word women's as in women's softball team captain. In 2018, Reuters broke the news that Amazon eventually had shut down the project. Now the million-dollar question. How come Amazon's machine learning model turned out to be sexist? A. AI goes rogue. B. Inexperienced data scientists. C. Faulty data set. D. Alexi gets jealous. The correct answer is C. Faulty data set. Not exclusively, of course. Data is one of the main factors determining whether ML projects will succeed or fail. In the case of Amazon, models were trained on 10 years worth of resumes submitted to the company, for the most part, by men. So here's another million-dollar question. How is data prepared for machine learning? All the magic begins with planning and formulating the problem that needs to be solved with the help of machine learning. Pretty much the same as with any other business decision. Then, you start constructing a training data set and stumble on the first rock. How much data is enough to train a good model? Just a couple samples? Thousands of them? Or even more? The thing is, there's no one-size-fits-all formula to help you calculate the right size of data set for a machine learning model. Here, many factors play their role from the problem you want to address to the learning algorithm you apply within the model. The simple rule of thumb is to collect as much data as possible because it's difficult to predict which and how many data samples will bring the most value. In simple words, there should be a lot of training data. Well, a lot sounds a bit too vague, right? Here are a couple of real-life examples for a better understanding. You know Gmail from Google, right? It's smart reply suggestions save time for users, generating short email responses right away. To make that happen, the Google team collected and pre-processed the training set that consisted of 238 million sample messages with and without responses. As far as Google Translate, it took trillions of examples for the whole project. But it doesn't mean you also need to strive for these huge numbers. I Chang Yeh, Tam King University professor, used the data set consisting of only 630 data samples. With them, he successfully trained the model of a neural network to accurately predict the compressive strength of high-performance concrete. As you can see, the size of training data depends on the complexity of the project in the first place. At the same time, it is not only the size of the data set that matters, but also its quality. What can be considered as quality data? The good old principle, garbage in, garbage out, states a machine learns exactly what it's taught. Feed your model inaccurate or poor quality data, and no matter how great the model is, how experienced your data scientists are, or how much money you spend on the project, you won't get any decent results. Remember Amazon? That's what we're talking about. Okay, it seems that the solution to the problem is kind of obvious. Avoid the garbage in part and you're golden. But it's not that easy. Say you need to forecast turkey sales during the Thanksgiving holidays in the U.S., but the historical data you're about to train your model on encompasses only Canada. You may think, Thanksgiving here, Thanksgiving there, what's the difference? To start with, Canadians don't make that big of a fuss about turkey. The bird suffers an embarrassing loss in the battle to pumpkin pies. Also, the holiday isn't observed nationwide, not to mention that Canada celebrates Thanksgiving in October, not November. Chances are, such data is just inadequate for the U.S. market. This example shows how important it is to ensure not only the high quality of data, but also its adequacy to the set task. Then, the selected data has to be transformed into the most digestible form for a model, so you need data preparation. For instance, in supervised machine learning, you inevitably go through a process called labeling. This means you show a model the correct answers to the given problem by leaving corresponding labels within a data set. Labeling can be compared to how you teach a kid what apples look like. First, you show pictures and you say that these are, well, apples. Then you repeat the procedure. When the kid has seen enough pictures of different apples, the kid will be able to distinguish apples from other kinds of fruit. Okay, what if it's not a kid that needs to detect apples and pictures, but a machine? The model needs some measurable characteristics that will describe data to it. Such characteristics are called features. In the case of apples, the features that differentiate apples from other fruit on images are their shape, color, and texture to name a few. Just like the kid, when the model has seen enough examples of the features that needs to predict, it can apply learned patterns and decide on new data inputs on its own. When it comes to images, humans must label them manually for the machine to learn from. Of course, there are some tricks, like what Google does with their recapture. Yeah, just so you know, you've been helping Google build its database for years every time you proved you weren't a robot. But labels can be already available in data. For instance, if you're building a model to predict whether a person is going to repay a loan, you'd have the loan repayments and bankruptcies history. Anyway, it's so cool and easy in an ideal world. In practice, there may be issues like mislabeled data samples. Getting back to our Apple recognition example. Well, you see that the third part of training images shows peaches marked as apples. If you leave it like that, the model will think that peaches are apples too. And that's not the result you're looking for, so it makes sense to have several people double-check or cross-label the dataset. Of course, labeling isn't the only procedure needed when preparing data for machine learning. One of the most crucial data preparation processes is data reduction and cleansing. Wait, what? Reduce data? Clean it? Shouldn't we collect all the data possible? Well, you do need to collect all possible data, but it doesn't mean that every piece of it carries value for your machine learning project, so you do the reduction to put only relevant data in your model. Picture this. You work for a hotel and want to build an ML model to forecast customer demand for twin and single rooms this year. You have a huge dataset with different variables, like customer demographics and information on how many times each customer booked a particular hotel room last year. What you see here is just a tiny piece of a spreadsheet. In reality, there may be thousands of columns and rows. Let's imagine that the columns are dimensions on the 100-dimensional space with rows of data as points within that space. It will be difficult to do since we are used to three space dimensions, but each column is really a separate dimension here, and it's also a feature fed as input to a model. The thing is, when the number of dimensions is too big and some of those aren't very useful, the performance of the machine learning algorithms can decrease. Logically, you need to reduce the number, right? That's what dimensionality reduction is about. For example, you can completely remove features that have zero or close to zero variance, like in the case of the country feature in our table. Since all customers come from the U.S., the presence of this feature won't make much impact on the prediction accuracy. There's also redundant data, like the year of birth feature, as it presents the same info as the age variable. Why use both if it's basically a duplicate? Another common pre-processing practice is sampling. Often you need to prototype solutions before actual production. If collected data sets are just too big, they can slow down the training process as they require larger computational and memory resources and take more time for algorithms to run on. With sampling, you single out just a subset of examples for training instead of using the whole dataset right away, speeding the exploration and prototyping of solutions. Sampling methods can also be applied to solve the imbalance data issue involving datasets where the class representation is not equal. That's the problem Amazon had when building their tool. The training data was imbalanced with the prevailing part of resumes submitted by men, making female resumes a minority class. The model would have provided less biased results if it had been trained on a sampled training dataset with a more equal class distribution made prior. What about cleaning them? Datasets are often incomplete, containing empty cells, meaningless records, or question marks instead of necessary values. Not to mention that some data can be corrupted or just inaccurate. That needs to be fixed. It's better to feed a model with imputed data than leave blank spaces for it to speculate. As an example, you fill in missing values with selected constants for some predicted values based on other observations in the dataset. As far as corrupted or inaccurate data, you simply delete it from a set. Okay, data is reduced and cleansed. Here comes another fun part, data wrangling. This means transforming raw data into a form that best describes the underlying problem to a model. The step may include such techniques as formatting and normalization. Well, these words sound too techy, but they aren't that scary. Combining data from multiple sources may not be in a format that fits your machine learning system best. For example, collected data comes in .xls file format, but you need it to be in plain text formats like .csv, so you perform formatting. In addition to that, you should make all data instances consistent throughout the datasets. Say a state in one system could be Florida, and in another it could be FL. Pick one and make it a standard. You may have different data attributes with numbers of different scales presenting quantities like pounds, dollars, or sales volumes. For example, you need to predict how much turkey people will buy during this year's Thanksgiving holiday. Consider that your historical data contains two features, the number of turkeys sold, and the amount of money received from the sales. But here's the thing, the quantity ranges from 100 to 900 per day, while the amount of money ranges from 1500 to 13,000. If you leave it like this, some models may consider that money values have higher importance to the prediction because they are simply bigger numbers. To ensure each feature has equal importance to model performance, normalization is applied. It helps unify the scale of figures from say 0.0 to 1.0 for the smallest and largest value of a given feature. One of the classical ways to do that is the min-max normalization approach. For example, if we were to normalize the amount of money, the minimum value, 1500, is transformed into a zero. The maximum value, 13,000, is transformed into one. Values in between become decimals. Say 2700 will be 0.1 and $7,000 will become 0.5. You get the idea. Up until now, we've been talking about working with only those features already present in data. Sometimes you deal with tasks that require the creation of new features. This is called feature engineering. For instance, we can split complex variables into parts that can be more useful for the model. Say you want to predict customer demand for hotel rooms. In your data set, you have date-time information in its native form that looks like this. You know that demand changes depending on days and months. You have more bookings during holidays and peak seasons. On top of that, your demand fluctuates depending on specific time. Say you have more bookings at night and much fewer in the morning. If that's the case, both time and date information have their own predictive powers. To make the model more efficient, you can decompose the date from the time by creating two new numerical features, one for the date and the other for the time. A machine learning model can only get as smart and accurate as the training data you're feeding it. It can't get biased on its own. It can't get sexist on its own. It can't get anything on its own. And while the unfitting data set wasn't the only reason for the Amazon AI project failure, it still owned a lion's share of the result. The truth is, there are no flawless data sets, but striving to make them flawless is the key to success. That's why data preparation is such a crucial step in the machine learning process, and that's why it takes up to 80% of every data science project's time. Speaking of projects, more information can be found in our videos about data science teams and data engineering. Thank you for watching.