 So, the sole purpose of data engineering is to take data from the source and save it to make it available for analysis. Frankly, it's so simple, like it's not even worth talking about. You click on a video and YouTube saves this event in a database. The exciting part is what happens after. How will YouTube use its machine learning magic to recommend other videos to you? But let's rewind a bit. Was it really that simple to put your click into a database? Let's have a look at how data engineering works. Okay, imagine a team with an application. The application works fine, traffic grows, and sales are selling. They track results in Google Analytics, the CRM, an application database, maybe a couple of extra tools they bought to spice up quarterly PowerPoint. And of course, there's this one quiet guy who's the absolute beast of Excel spreadsheets. Analytics, great. At this point, their, mmm, analytics data pipeline looks like this. There are several sources of data and a lot of boring manual work to move this data into an Excel spreadsheet. This gets old pretty fast. Well, first, the amounts of data become larger every month, along with an appetite for it. Maybe the team will add a couple more sources or data fields to track. There isn't too much data when it comes to data analytics. And of course, you have to track dynamics and revisit the same metric over and over again to see how it changes month after month. It's so 90s. The days of the analytics guys start resembling the routine of a person passing bricks one at a time. There's a good quote by Carla Geiser from Google. If a human operator needs to touch your system during normal operations, you have a bug. So before the guys burned out, the team decides to automate things. First, they print the quote and stick it on the wall. Then they ask a software engineer for help. And this is the point that data engineering begins. It starts with automation using an ETL pipeline. So the starting goal is to automatically pull data from all sources and give an analytics guy a break. To extract data, you would normally set up an API connection and interface to access data from its sources. Then you have to transform it, remove errors, change formats, map the same types of records to each other, and validate that the data is OK. And finally, load it into a database, let's say MySQL. Obviously, the process must repeat itself every month or even week, so the engineer will have to make a script for that. It's still a part-time job for the new data engineer, nothing to write home about. But congratulations, there it is, a simple ETL pipeline. To access data, the team would use so-called BI tools, business intelligence interfaces, those great dashboards with pie charts, horizontal and vertical bars, and of course, a map. There's always a map. Normally BI tools come integrated with popular databases out of the box, and it works great. All those diagrams get populated with new fresh data every week to analyze, iterate, improve, and share. Since there's convenient access to insights, the culture of using data flourishes. Everything now can track the whole sales funnel from the first visit to a paid subscription. The product team explores customer behavior, and management can check high-level KPIs. It all feels like the company has just put on glasses after years of blurriness. The organization starts becoming data-driven. The team now can make decisions based on their actions and receive insights via business intelligence interfaces. Actions become meaningful. You can now see how your decisions change the way the company works. And then everything freezes. Reports take minutes to return. Some SQL queries get lost, and the current pipeline doesn't seem like a viable option. It's so 90s, again. The reason this happens is that the current pipeline uses a standard transactional database. Transactional databases like MySQL are optimized to rapidly fill in the tables. They are very resilient and are great to run operations of an app. But they aren't optimized to do analytics jobs and process complex queries. At this point, a software engineer must become a full-time data engineer, because the company needs a data warehouse. Okay, what's a data warehouse? For the team, this is the new place to keep data instead of a standard database, a repository that consolidates data from all sources in a single central place. Now, to centralize this data, you must organize it somehow. Since you're pulling or ingesting data from multiple sources, there are multiple types of it. These may be sales reports, your traffic data, insights on demographics from a third party service. The idea of a warehouse is to structure the data that gets into tables and then tables into schemas, the relationships between different data types. The data must be structured in a meaningful way for analytics purposes. So it will take several iterations and interviews with the team before arriving at the best warehouse design. But the main difference between a warehouse and a database is that a warehouse is specifically optimized to run complex analytics queries, as opposed to simple transaction queries of a regular database. With that out of the way, the data pipeline feels complete and well-rounded. No more lost queries and long processing. The data is generated at sources, then automatically pulled by ETL scripts, transformed and validated on the way. And finally, populates the tables inside the warehouse. Now, the team with access to business intelligence interfaces can interact with this data and get insights. Great. The data engineer now can focus on improvements and procrastinate a bit, right? Well, until a company decides to hire a data scientist. So, let's talk about how data scientists and data engineers work together. A data scientist's job is to find hidden insights and data and make predictive models to forecast the future. And a data warehouse may not be enough for these tasks. It's structured around reporting on the metrics that are defined in advance. So the pipeline doesn't process all the data. It uses just those records that the team thought to make sense at the moment. Data scientists' tasks are a bit more sophisticated. This means that a data engineer has more work to do. A common scenario sounds like this. A product manager shows up and asks a data scientist, can you predict the sales for Q3 in Europe this year? Data scientists never make bold promises, so her response is, it depends. It depends on whether we can get quality data. We'll guess who's responsible now. Besides maintaining and improving the existing pipelines, data engineers would commonly design custom pipelines for such one-time requests. They deliver the data to the scientist and call it a day. Another type of system needed when you work with data scientists is a data lake. Remember that the warehouse stores only structured data aimed at tracking specific metrics? Well, a data lake is the complete opposite. It's another type of storage that keeps all the data raw without pre-processing it and imposing a defined schema. The pipeline with a data lake may look like this. The ETL process now changes into extract, load into the lake, and then transform, because it's the data scientist who defines how to process the data to make it useful. It's a powerful playground for a data scientist to explore new analytics horizons and build machine learning models. So the job of a data engineer is to enable the constant supply of information into the lake. Lakes are the artifacts of the big data era when we have so much diverse and unstructured information that capturing it and analyzing becomes a challenge in itself. So what is big data? Well, it's an outright buzzword used mindlessly everywhere, even when somebody hooks a transactional database to a BI interface. But there are more concrete criteria that professionals use to describe big data. Maybe you've heard of the four Vs. They stand for volume, obviously. Variety, big data can be both structured and aligned with some schema or unstructured. Veracity, data must be trusted and it requires quality control. And velocity, big data is generated constantly in real time. So the companies dealing with the real big data need the whole data engineering team, or even big data engineering team. And they wouldn't be running some small application. Think of payment systems that process thousands of transactions simultaneously and must run fraud detection on them or streaming services like Netflix and YouTube that collect millions of records every second. Being able to run big data means approaching the pipeline in a slightly different manner. A normal pipeline that we have now here pulls the data from its sources, processes it with ETL tools and sends the data into the warehouse to be used by analysts and other employees that have access to BI interfaces. Data scientists use both data available at a warehouse, but also they query a data lake with all raw and unstructured data. Their pipeline would be called ELT because all transformations happen after data gets loaded into a storage. And there's some jungle of custom pipelines for ad hoc tasks. But why doesn't it work for big data that constantly streams into the system? Let's talk about data streaming. Up to this moment, we've only discussed batch data. This means that the system retrieves records on some schedule, every week, every month, or even every hour via APIs. But what if new data is generated every second and you need to stream it to the analytical systems right away? Data streaming uses a way of communication called PubSub or publish and subscribe. A little example here. Think of phone calls. When you talk on the phone with someone, it's likely that you're fully occupied by the conversation. And if you're polite, you'll have to wait until the person on the other side finishes their thought for you to start talking and responding. This is similar to the way most web communication works over APIs. The system sends a request and waits until the data provider sends a response. This would be synchronous communication. And it gets pretty slow if the sources generate thousands of new records. You have multiple sources and multiple data consumers. Now imagine that you use Twitter. Tweets get added to your timeline independently. And you can consume this information at your own pace. You can stop reading for a while and then come back. You'll just have to scroll more so you control the flow of information and several sources can support you with data asynchronously. The PubSub enables asynchronous conversation between multiple systems that generate a lot of data simultaneously. Similar to Twitter, it decouples data sources from data consumers. Instead, the data is divided into different topics or profiles in Twitter. And data consumers that subscribe to these topics. When a new data record or event is generated, it's published inside the topic, allowing subscribers to consume this data at their own pace. This way, systems don't have to wait for each other and send synchronous messages. They now can deal with thousands of events generated every second. The most popular PubSub technology is Kafka. Not this Kafka. Yes, this one. Another approach used in big data is distributed storage and distributed computing. What is distributed computing? You can't store petabytes of data that are generated every second on a laptop. And you won't likely store it on a single server. You need to have several servers, sometimes thousands, combined into what's called a cluster. A common technology used for distributed storage is called Hadoop, which means, well, it actually means nothing. Just the way a two-year-old called his toy elephant. But the boy happened to be the son of Doug Cutting, the creator of Hadoop. So Hadoop is a framework that allows for storing data in clusters. It's very scalable, meaning that you can add more and more computers to the cluster as your data gargantua keeps growing. It also has much redundancy for securing information. So even if some computers in the cluster burst into flames, the data won't be lost. And of course, ETL and ELT processes require specific tools to operate Hadoop clusters. To make the stack feel complete, let's mention Spark, the popular data processing framework capable of this job. Finally, this is what an advanced pipeline of a company operating big data would look like. You stream thousands of records simultaneously using pub subsystems like Kafka. This data gets processed with the use of ETL or ELT frameworks like Spark, and then it gets loaded into lakes, warehouses, or travels further down custom pipelines. And all of the data repositories are deployed on clusters of several servers that run with tools for distributed storage, like Hadoop. But this isn't nearly the end of the story. Besides data scientists and analytics users, the data can be consumed by other systems, like machine learning algorithms that generate predictions and new data. So the sole purpose of data engineering is to take data from the source and save it to make it available for analysis. Sounds simple, but it's the matter of the system that works under the hood. When you click on a YouTube video, this event travels through a jungle of pipelines, is saved in several different storages, some of which will instantly push it further to suggest next video recommendations using machine learning magic. Talking about magic, check our previous video that has more information about data science and teams that work with data. Thank you for watching.