 In 2017, Netflix changed its five-star rating system to a simple thumbs-up, thumbs-down. Now the service was recommending movies based on the match percentage, and people hated it. How can we reduce all the nuance that lives in cinematic art to a primitive, binary reaction? In reality, what Netflix found was that people were giving high rates to those movies that they believed were good, not necessarily those they really enjoyed watching. At least, that's what the data said. So, how does data analysis work in organizations like Netflix? And what are the roles in data science teams? This is Gibson Biddle, a former VP and Chief Product Officer at Netflix. When talking about consumer insights, he explained an unexpected customer behavior that led to changing the whole rating system. In shifting to percentage match, Netflix acknowledged that while you may rate a Leave Your Brains at the Door Adam Sandler comedy Only Three Stars, you enjoy watching it. And as much as you feel good about watching Schindler's List and give it five stars, it doesn't increase your overall enjoyment. And keeping subscribers entertained is kind of critical for Netflix, so they simplified the feedback system to avoid bias. But these insights into customers are impressive by themselves, and they wouldn't be possible without two things. The culture that fosters the use of data and a powerful data infrastructure. In tech jargon, it's called a data-driven organization. You've likely heard this buzz phrase hundreds of times, but what does it really mean? Netflix alone records more than 700 billion events every day, from logins and clicks on movie thumbnails to pausing the video and turning on subtitles. All this data is available to thousands of users inside the organization. One can access it using visualization tools like Tableau or Jupyter. Or they can get to it via a big data portal, an environment that lets users check reports, generate them, or query any information they need. Then this data is used to make business decisions, from smaller, like which thumbnails to show you, to really serious ones, like in which shows should Netflix invest next. But Netflix isn't alone. According to some estimates, about 97% of Fortune 1000 businesses invest in data initiatives, including artificial intelligence and big data. Buzzwords again, but let's have a look at the real data infrastructure technology and data engineers that make it work. To describe how data infrastructure works, technicians borrowed the term from liquid and gas transportation. Similar to physical pipelines, data pipelines have their own origins, destinations, and intermediate stations. So it's a pretty apt metaphor. The origin of data may be anything, from clicks on a reserve button and pull to refresh to conversation records with customer support, from vehicle tracking devices to turbine vibration sensors on power plants. In today's world, it's actually harder to say what cannot generate data rather than what can. Even no data can tell us something. Once the data item is generated, it travels down its pipe to a staging area, right This is the place where all raw data is kept. Raw data isn't yet ready to be used. It must be prepared. You have to remove the airs from it. Fill in the gaps, change its format, or merge data from different sources to get a more nuanced view. As soon as these operations are done, the data, now structured and clean, can't continue on its journey. All these operations happen automatically. They are described in three words. Extract, extracting data from its origin and getting it to a staging area. Transform, preparing data for use, and load. Push prepared data further. ETL, for short. All prepared data falls into another storage, a data warehouse. Unlike the staging area, a warehouse is a place where all stored records are structured and prepared for use, just like in the library with its classification system. Finally, you can query, visualize, and download information from a warehouse To do that, you must have business intelligence or BI software. It presents data to final users, data analysts and business analysts, who carry out essential tasks. They access data, explore it, visualize it, and try to make business sense of it. Did our marketing campaign work out well? What's our worst performing channel? They act like a sensory system supporting an organization with historical data and getting insights to management and ultimately anyone who makes decisions. OK, who's in charge of building this whole pipeline? Traditionally, these specialists are called data engineers. Mostly tech people adept at what's known as plumbing, moving data from its origins to destinations across the pipeline and transforming it on the way. They design pipeline architecture, set up ETL processes, configure the warehouse and connect it with reporting tools. Airbnb, for instance, has about 50 data engineers. Sometimes you might encounter a more granular approach with several extra roles involved. Data quality engineers, for instance, make sure that data is captured and transformed correctly. Having biased or incorrect data is too expensive when trying to derive decisions from it. There may be a separate engineer responsible for ETL only, and also a business intelligence developer focusing solely on integrating reporting and visualization tools. However, reporting tools don't make headlines, and a data engineer wasn't called the sexiest job of the 21st century. But machine learning does, and a data scientist was. What everybody knows is that data science is particularly good at taking data and answering complex questions about it. How much will the company earn in the next quarter? How soon will your Uber driver arrive? How likely is it that you'll enjoy Schindler's List the same as uncut gems? There are actually two ways of answering such questions. Data scientists make use of BI tools and warehouse data as business analysts and data analysts do. So they would sit here and get the data from the warehouse. Sometimes data scientists would use a data lake, another type of storage that keeps unstructured raw data. They'll create a predictive model and suggest a forecast that will be used by management, one time reporting, and it works for revenue estimates, but it doesn't help with predicting the Uber arrival time. The real value of machine learning is production models. Those that work automatically and generate answers to complex questions regularly, sometimes thousands of times per second, and things are much more complicated with them. To make the model work, you also need an infrastructure, sometimes a big one. Have a look at this dramatic image. Not in the way most people consider the meaning of this word, obviously, but for data scientists, it really is dramatic. Notice this tiny box in the middle, which says, yeah, let's zoom it in, please. It says ML code. The paper is called Hidden Technical Debt in Machine Learning Systems by Google Engineers, and the image compares the amount of machine learning code to the rest of the systems that make machine learning code useful. Without them, this tiny box, however brilliant it may be, is a relatively small piece of code in Python or in Java. But it's actually pretty hard to arrive at this model. Data scientists explore data from warehouses and lakes, experiment with it, choose algorithms, and train models to come up with the final ML code. It takes a deep understanding of statistics, databases, machine learning algorithms, and a subject field. In his famous tweet, Josh Wills, former head of data engineering at Slack, said that a data scientist is the person who is better at statistics than any software engineer and better at software engineering than any statistician. What about the rest of those boxes? OK, imagine yourself isolating and ordering food at Uber Eats. Once you confirm your order, the app must estimate the time of delivery. Your phone sends your location, restaurant, and order data to a server with the delivery prediction ML model deployed. But this data isn't enough. The model also gets additional data from a separate database that contains, say, an average time for your restaurant to prepare a meal and a wealth of other details. Once all the data is here, the model returns a prediction to you. But the process doesn't stop there. The prediction itself gets saved in a separate database. Your delivery person shows up, and the real time of arrival will also be captured to record the ground truth, monitor the model performance against it, and explore the model via analysis tools to update it later. And all this data will eventually appear in a data lake and a warehouse. In reality, Uber Eats service alone uses hundreds of different models working simultaneously to score recommendations, search rankings of restaurants, and estimate delivery time. If you have that level of complexity, you also need a clever system to update and retire models, as well as prioritize some models over others to manage computing resources. That's a lot to process. Usually, this job falls on the shoulders of data engineers or machine learning engineers. ML engineers take charge of the production side of things. They aren't as much into statistics and subject matter as data scientists, but they know how to configure production models, automate extraction of specific data from multiple sources, and verify data quality before use. Finally, if you run machine learning with hundreds of models deployed, you need a data architect role to make the work of the whole data platform consistent. This person would be responsible for the platform itself and its capabilities rather than how specific models solve real life problems. These six roles are those you frequently meet today, but things will be changing in the future. Look at how people imagined our time in 1982. If you ever glanced out of a window in 2019 when Blade Runner takes place, you didn't see the dystopian architecture, flying cars, or multi-store commercial holograms. In fact, the real future looks like this, or this, or even like this. You can't touch data. You'll have a hard time explaining what data means. But that's what defines the real future we're living in today, and data science and business intelligence will soon be taken for granted. Adam Walksman, head of core technology at Foursquare, believes there won't be data scientists or ML engineers anymore. Since we'll keep automating model training and building production environments, much of the data science work will become a common function inside software development. Thank you for watching. If data is what you deal with every day, tell us more about your work in the comments section below. You may also send meaningful signals to YouTube's machine learning algorithms if you liked the video and want to see more.