 Good morning, everyone. My name is Anand. I'm working as a data engineer at Slack. My team manages data infrastructures, such as Kafka clusters, Spark, Presto, Plank, and Presto, like, Airflow is one among them. This talk, I'm going to talk about operating data pipeline with Airflow. Though the talk is fine-tuned towards Airflow, it's not necessarily, you can replace Airflow with any other tools. The data pipeline principle that we followed can be applied to any other tools that you're actually using in your system. Just to set up an expectation, this is not an introduction to Airflow or like how to use Airflow. This is basically, once you deployed Airflow and once you started to get more users into your system, how do you scale and how do you really democratize data pipeline across organizations, right? This is a war story that whatever, the problems that we face and the solution that we found out and how do we overcome the issues. So, just out of curiosity, how many people hear about Slack? Wow, how many people are using Slack? Amazing, thank you. My doubt is safe. So, just a brief overview of Slack. Public launch in 2014. 65 out of 1400 companies sort of paid customers of Slack and more than 500K organizations are currently using Slack. It's one of the fastest growing enterprise software. To give a scale, we have more than 1,000 plus Slack employees across seven countries. But we have only 12 data engineers and I give a numbers for a reason and I'll tell you that if you go through the talk, you will find out why there is only 12 data engineers and how do we optimize our system to scale for the large fastest growing organizations. Right now we have around 8 million active users. Our active users are always connected on a booking system or anything like on an average and user connected 10 hours to the system so that add more complexity to the locking instrumentation and scaling the infrastructure at scale. And the data usage, one in two within the Slack accessing our data warehouse system, trying to find information. It's heavily a data-driven decision-making process and most of the people have the easy access to data warehouse systems and the knowledge that we gather. We have around 500 plus tables in our data warehouse system. Everything produced by Airflow or orchestrated by Airflow. We are getting around 70 million events per minute. And to Airflow stats, we have currently around 240 plus active DAX and we are running around 6,000 tasks per day. What I mean by task is like either a spark job or a hive job or a fling job, so it is right. So we're running around 6,000 plus those jobs. And we have 72 contributors. Actually, within the Airflow repository, there are 72 people actually contributed or created DAX. So this is a key factor. So if you see here, there's 12 data engineers and how do you enable people from organization to contribute and create your own data pipeline? And that's the key to the success and that's one of the beauty of Airflow. It's a very plain Python DSL. There is no JSON based or some custom domain knowledge base. It's like a plain vanilla Python. So anyone who knows a basic Python can create a data pipeline on their own. So that's the success of Airflow and that's helped us a lot to spread adoption across our organizations. So agenda for this talk, I'll briefly go through Airflow infrastructure and how do we really operate Airflow? There are a lot of surprises. And how do we scale Airflow? And how do we automate pipeline operations and loss like alertings and monitoring? Airflow infrastructure, before I go to Airflow, I have to ask this first, how many people are actually using Airflow? Okay. So I will go through a brief overview of Airflow in a very nutshell so that that will set the context for you what I'm going to talk about. So in nutshell, Airflow has only three major components as a DAC. So DAC is a graph of Airflow operators or the task. You have a Spark job, you have a Hive job. So one job basically taking data from tweets and then the other job trying to munching some data. And then so the logical grouping of task is basically a DAC. So and the task in operators can be used. So operators. So in largely you have three kind of operators. Sensor operator, which is basically, let's say I wanted to have some files arrive and after that I have to run some Spark job. So you create a sensor operator and then that's basically sensing some signal. Whether it could be a file, it could be some API call or whatever it is, right? So the sensor operators basically a first check for your data pipeline waiting for the required files to arrive to run your job. And then there's an operators, then the action operators, that's basically an actual unit of executions. It could be a Hive job, it could be a Spark job, it could be a simple bash operator you want to run as Python code. You could do that in the action operator. And then there's a transfer operators. Basically you wanted to like pull the data from Salesforce to S3 or Salesforce to HTFS. It's basically an integration operators, right? So these are the three operators. And then the task has to execute somewhere, right? There has to be some executors and some scheduler has to schedule it and some executor has to execute it, right? So Airflow inbuilt comes with three broad executors. So local executors, what do you mean by local executor is like whenever a new task created, it'll create a new Python process in your system. So a local executor runs on a single machine. It's not distributed, it's runs on a single machine. It was originally created to test Airflow scheduler. It was never recommended to be used in production. So the way a local executor works, it create a new Python process. Whenever it started executor sensor, it started to execute an action or a transfer object. It create a new Python process, right? And silly executors use Celerie. No surprise in that. So Celerie used as a distributor work manager. So trying to use Celerie. And there's also a MeSOS executor. The Kubernetes executor is in progress. They're trying to build, I think, for the next version of Airflow. Airflow will be tightly integrated with Kubernetes also. And I wanted to touch base briefly on the sensor operators and the lifecycle of the operators, right? How the operators actually works, the sensor operator, how it works. Let's say you have a daily job that's running and there is a file that depends and you wanted to run a particular task. The spark job need some input files and you need to wait for the input file, right? The sensor operator basically wake up, whatever the schedule time that you do, and you just keep waiting. The Python process just keep waiting for the files to come. When the files arrive, it says, okay, I met the success criteria and I say my state is success. Go and execute the next task, right? This is how the whole operation works, right? So this is a simple example of a DAG, how it will look like in the Airflow UI. You can see here, there's a fetch state and then there's a series of tasks need to be run after analysis of tweets, right? So you can, whatever, if you're familiar with the enterprise integration pattern, you can do whatever way of pipelining this. You can customize your DAGs whatever way you want, right? So that's because end of the day, just a Python code. So you have the flexibility of a high-level language that you can do. It's highly customizable, thanks. So after infrastructure at Slack, surprisingly, we use local executor. We don't use Celery or we don't use any other distributor executors. We use local executors. The reason is you can see the scale that we operate, but we're able to scale local executors. I'll tell you how we scaled it, but the fundamental philosophy behind the local executor is this. You're in the cloud era, like the AWS era, right? Your CPU and memory is very cheap, your code isn't. So the moment you operate or introduce any Celery or any other executor system, you add more complexity to the system, whatever you have it already. So this is an attempt to say, okay, there is a single box. How can we scale, right? How can we minimize? How can we burn more CPU and memory and minimize our maintenance cost not to add any extra components, right? And then Torbal-based deployment, continuous deployment with Jenkins, the way the work process works is like anyone can create a DAC, anyone contribute to the GitHub repository. Everyone in Slack has access to every code base. So if somebody created a DAC, somebody, they push the code review in any of the Slack channel. And when the code review passed, they just merge, the code merged into the master, Jenkins, take it, and then create a Torbal deployed into your Airflow box, right? So the moment you push the code, your DAC is alive. So it's a very continuous integration process but at any given point of time, we deploy more than 70, 80 times daily. New DAC changes going along to the system. We also created something called Airflow.SHA, just basically a small utility script. You can install Airflow ecosystem in your local machine. It gives very good flexibility for our users because they don't need to push the code into the server and test it what it is. They can test in the local machine how the code works before they actually push the code into the production system. So these are the utilities are very key to get more adoption because you have to make, the barrier, you have to minimize the barrier of entry so that you can get more people actively contributing to the system. So it's not all started very nice. We also got into some problems. So when there are certain times we got into a lot of issues, Airflow and people started to complain why my task is not running. Is that a deadlock issue is going on? Airflow is not scheduling any task. Is that a reliable system that we wanted to rely on? Should we really go for a silly executors? Like how do we scale, right? We got into the bottleneck of it. And then we started to take a look, okay, what is really happening here, right? And you can see a CPU, this is a CPU idle state of a stats of Airflow box. You can see at 16.0, which is 12 I am midnight UTC, which is when all of our daily jobs started to run. You can see the CPU getting a very heavy hit. And the CPUs stay very heavy load and it's not able to schedule any new task because the nature of local executor is whenever a new task starter, it create a new Python process. So only so many process that you can spin off in a single box, right? But why this is happening is we take MySQL back, like MySQL is our primary database. We take a snapshot every day from our MySQL database. When I checked last time, we have around 900 shots of MySQL instances. So we take a 900 shots of MySQL instance and we take a backup regularly, right? Every day we take a backup and we do restore and check whether we're able to restore the system. That is a very large process. It takes like at least six to eight hours to run in a very large in-house Golang build system. So all the daily jobs, the sensors, wake up on the same time on the UTC and it just keep occupying the process lot. It is not doing anything. It's not running any task. It's not doing any things, but just waiting for the MySQL files to arrive, right? So that is when the bottlenecks started to come, like what we can do out of it, right? So this is an explanation of it, right? Local external or just new Python interpreter per task. So that's what I explained in before. So what is the solution here? Actually it's a very simple hack. The nature of sensor, lifecycle of the sensor operator is very, it's like very, very narrow, it's like basically it wake up, it wait for the signal file, and once it comes, it dies. There is no retrievable lifecycle in it, right? So we extend airflow sensor operator and we create a retrievable operator. So what retrievable operator says for the sensors, you wake up, check the files are not there, or then if not yield to other process, basically you die and then retry after two hours, right? So it's a very, very simple 10 line of Python code, nothing very fancy in this here. And you can see the changes here. After the, this is non-triable sensor, and this is a retrievable sensor load, you can see the nice exact waves going along. What is happening here, all the sensor operators started to yield, and it gives process slot to other process. So we solved one of the biggest problems, so we don't need to sell or execute us anymore, because we able to run all the tasks in a single box, minimize a lot of operation overhead for us. So that's a scaling of airflow, very simple hack, able to save a lot of bandwidth on it, and then pipeline operations, right? When you have 240 plus tags and 6,000 tasks, your pipeline is already very complex. And these are some of the airflow philosophies, it's, I should say this is a data pipeline philosophies, philosophy is not specific to airflow. The upstream task success is reliable, it is not. Let's say you have an upstream task, there's a spark job that calculate number of customers, and there is, there's a team ID that is missing, right? And now what happened? Is that a real, the job succeeded, but there is a missing data in there? Is that really success? It is not, right? So even if it is say that it is success, it is not reliable until you verify it. The task remains static after the success state, it is not, as I said earlier. When you found out there is a missing file, then you wanted to run it again, right? You found out after three, four downstream jobs, like the spark job customer stats got created, and then you're trying to sell, create some report out of the stats, and then you found out, hey, why the graph is not showing good? It says something more wrong with that. And you go and back forward and found out, oh, this is, this job is wrong. Now you need to re-run the job, and what it mean that you had to re-run the entire pipeline, right? Not only for the particular task, you have to backfill everything. And the DAX structure is static, it is not. Business is dynamic. Today it can be a customer stats, tomorrow can be an enterprise stats, can be a customer attribution, so you will add more task, you will remove more task. Today whatever the task that needed may not be necessary for tomorrow, right? So the DAX structure always changes. And data quality is not part of the task lifecycle. I think this is a serious of mistake. If you build data pipeline in any system, if you don't account data quality as part of the task lifecycle, you're asking for trouble, right? And the cost you are going to repay is very high because you will eventually found it. It's not like you are going to miss that. But the cost to repair, you wanted to run the whole job, whole pipeline again and again, that's as a human effort and there is a cost to the company also. So task is actually part of the life. Data quality is part of the life of lifecycle. So how do you, how do we fix it, right? So we created an in-build tool called Mario. So what Mario's do? Airflow is very good, but it gives a visibility only for a particular DAX. But if you are running up out like more than 240 DAX, then your task is already tightly coupled. Your task visibility go across multiple DAX, right? Like multiple tasks depends on each and everything. So we created a tool called Mario that basically gives an entire visibility of the whole data pipeline. So you can use a Mario and say, hey, show me my downstream jobs. And I, let's say I found out some issue with the customer stats and I wanted to fix it. And I can issue a Mario command and say like, okay, I fixed this task now, I fixed the code, I push the code now, but I need to run all the downstream jobs. Now show me what is my downstream jobs, right? And a lot of DAX will be owned by different, different people. Like not like necessarily one people can own those things. DAX can be owned by across different teams also. So you have to notify them, hey, this is happening and we are going to run it again. So Mario is a tool that basically gives an end to end visibility of the entire data pipeline, which gives a very powerful utility for us, for our customers to make a sense of what is happening in the data pipeline, right? We also created a series of operators, like our own operators. This is one of the beauty of Airflow that it's highly extensible. You can create your own operator, you can create your own executor, you can create your own sensors, right? So we create a number of Airflow operators customizable for our needs. And it's very simple to create. So we created a high partition sensor operators, basically says because all our files are in S3 and we created a high partition in the high meta store. So S3 is famously known for the eventual consistency issue. I'm not sure how many people burn their fingers with that. So we have to be make sure the data exists and the metadata exists on the different systems to be stable, like it's all aligned so that we can run the task, right? So we created a high partition operators, verify the state of a table in different systems that's been used. And this is important, DQ check. So we created a DQ check operator. So anyone who created a task, they can include a DQ check. It's a simple SQL statement. Once the table created, like this is a very simple example, like say DW dot my table. And the key has to be unique and the count of one should be over one, right? This is a very simple thing that says, I created a table, is that table has any records or not? It basically checked whether the count of one is greater than C, right? This is just an example. There was a lot of complex SQL being built, trying to understand whether the team ID is missing, whether the user ID is missing, whether the building information is missing and all sort of things, right? So the DQ check, once the task completed, the DQ check basically runs. And so delete DAC, DAC is a formal, you can delete the DAC. So we created a utility that you can run delete DAC and just manage from the system. And this is also important because now we have 72 plus contributors and everyone writing the DAC, not necessarily everyone is a data engineer, right? We have people who's working in the sales team actually wrote some DACs just for fun. But how do you maintain the sanity of the system, right? That's a key part. So we create number of high test operator. So all this check validator, once the code pushed into your branch, all the validator keeps started running and it will make sure that all the verification is done then only it will allow you to merge the code to the master's branch. So some of the validators that we build, external task sensors, because you are depending on some other task that belongs to some other DACs. And somebody removed that task. Now what will happen to your task? Because you are waiting for some task but it does not exist in the system anymore. Then it's wrong, right? So we capture the system before you actually push the code into the system. And circular dependencies. Like two DACs can depend on each other and that create a circular dependencies and none of them will run. So you have to be very careful on that. And priority weight. So this is a key thing. So we have some tables that high priority. For instance, billing stats. Every day at 9 a.m. we have to send a report to a CEO, right? For instance. And which means we have to maintain a strong SLA, a service level agreement that table will be land like 7 a.m., whatever it is. So in that case, you have so many tasks that actually run parallel to create the task. So if your task is billing stats and your priority weight is 10, all your upstream, it'd make no sense if your upstream task has a priority of two, right? Your upstream task either has to have the same priority or it has to have superiority above that. So we check that. And if the priority, and if you have a high priority task, you have to retry on failures. So, and if you have a high priority task, you should have an SLA for that, right? And you should have, and the SLA timing should make sense. You have an upstream, you have upstream task is like toll AM is your SLA and five lines down the job, you have the same toll AM as an SLA, it makes no sense, right? It's literally not possible to met the SLA. So we check whether the SLA is meaningful or not. And it has a retry, success callbacks, whether in all the goodies that actually they in-built it or not. And if you are a production task, you should have, it is mandatory to have a DQ check. It's okay for you to experiment, do whatever you want, but the moment you wanted to push some task and you mark it as a production task, you should have a DQ check. It's a mandatory to check on that. So that's what the checks that we do, right? Why do we do all those check as an organization, like as you mentioned, like data engineers like 12 people, it's literally not easy for us to write all the data pipeline. We have to empower people to do that. So when you empower the people, you should focus on building more tools to make, feel them better and then more confident to push the code to the production because you have a strong validation check is in place so that they will be feel comfortable to push the code into that. So all the validator check is an effort to democratize the data on those things. And finally, alerting and monitoring. Okay, we have this, so we have the system, we empower people to do a pipeline. We also need to alert and monitor if something goes wrong, right? Failure is inevitable, that's the only thing. Everything can go wrong, can go wrong in a production system, right? So how do we capture the system and how do we alert the people who need it and how do we fix that easily, right? And when you have a 72 contributors, not everyone will look into the system and try to fix as soon as this happens. The data team and of the data responsible for fixing all the production data, it doesn't matter who write it, but if it is failing the production, if there's no point of pointing fingers to somebody else, hey, it's your problem to fix it, right? You have a system at crisis and you have to fix it first and then think about what is next, you can do about it, right? So maintaining a strong alerting and monitoring system is a key in any systems. So what are the principles that we are about? Like when we started to think about alerting and monitoring, how the alerting and monitoring should be looked like, right? These are the four principles. Alerting should be reliable. When we alert, it's really something bad happened. And alert should be actionable. When we alert, as I mentioned, like anyone who actually monitoring the system that time should be able to fix it, at least get a good hands of what is happening, right? It should have a good information. You get an alert and you should not just think, oh, what is happening really? I don't know what is really going on here, right? And alert when it really matters. There is no point of alerting a task which is not production critical at a 3 a.m. midnight and waking up someone. And if you are waking up 3 a.m. sometime midnight and you basically waste your productivity the whole day, right? You lost your sleep, you lost your productivity the whole day and the cost to the company is so high then you're waking up, right? So that's one thing that we very careful on that, don't wake up people unless and until it's necessary because it's not only frustration for the people, also you're not able to concentrate on your work. And suppress repeatable alerts. When something prior goes on and you should not bombard more and more and more alerts to create more pressure out of it, right? So you should suppress repeatable alerts and make very meaningful out of it. So Aarflow has a success callback. If a task is success, it'll give you a callback. You can extend the callback and write whatever things that you want. It also has a failure callback and all those things. So we created this on call alert callback and you can give all the information, channels, priority, you can say escalation chain what is a note of it and why is there a runbook that you can find? What is a page entity URL for that? We also integrated that into Aarflow UI, all the runbook on call dashboard, production dashboard. So anyone goes and look into the dashboard, they know what is happening in the data pipeline currently. So this is a sample alert look like. So this says SLMS and note says this is a core pipeline. No downstream task will run until it's success. So it's gives and it's specified the importance there's something wrong seriously. And this is an escalation chain or whom you wanted to contact and what is a high priority, what is a playbook for that or how to fix it. This is a sample DQ check. Basically it's a warning and it clearly says this is not blocking, it can happen but it is good for us to know. Don't do that anything, we can investigate later. So this is just a warning for you. We also have a failure that says, hey, it failed on the row count. So something wrong in these things, no downstream job run. This is a production critical, go and fix it. So all this improvement happened. This is the conversation happened in the alerts channel. This quite in here, this little two quite in here. This basically says like number of, the amount of work that we put forward in the platform to like standardize and improve the platform as a whole. So summary, as I mentioned, keep your infrastructure simple. If you can scale with a single machine, go for it. If you are trying to do airflow first, use local executors, it can scale, at least we able to scale it, so no problem. And the second thing is a very important thing. As a platform engineer, you don't own the platform. The users own the platform. The moment you realize that, that's when the true democratization of a platform happened. That's when a lot of people actively engage on your platform because they own the platform. Your job is to make their life a lot easier. And automation is got standardized wherever it is possible. So yeah, that's it. Thank you. We have five minutes for questions. Hello, can you hear me? Yeah. So have you ever had to manage data provenance? Versioning of your data. Say you had to retrieve, run a DAG on an old version of some data. Right. Have you ever done that? Yeah. So how do you manage that? So that's an entire different talk on its own, on a data versioning. I can give a small nutshell. All of our logging infrastructure we use through protocol. So it's like, if I cannot repeat that again and again, like if you are using JSON or any other loosely coupled format for a logging infrastructure, you are making a serious of mistake. Don't do that. So binary protocols like thrift, Avro has a inbuilt backup compatibility and forward compatibility. So that gives you a lot of versioning support for that. And there is a whole data engineering concept of like snapshot based isolation. The traditional data or a system, what Kimple methodology is all this doing, what comes when the data scarcity available, right? The storage scarcity. So there is a lot of type one type two systems. It's not the case anymore. Like storage is very cheap, right? So snapshot based isolation, you can run, you can write N number of version of snapshot into a particular table. And you can always create a view to point out what is the latest table and what is the previous version of it, right? Hi. Just wanted to understand how you overcame the S3 eventual consistency issue. Yeah. So there are two things. The way we use EMR like elastic map reduce as an inbuilt EMR of us and you can integrate with DynamoDB. So what happened is that if you write the table, if your files into S3, EMR inbuilt right to the DynamoDB. So the list, S3 is not really an eventual consistency. It's like the list operation is the problem. It's read your own right consistency, right? So when you do a list operation of a whole bucket, that's when the problem comes in. So what EMR does, all the list files, it basically add it to DynamoDB. So when you read the list operation, don't go and call the S3 API. It'll go and call the DynamoDB API and it get all the files. Now you have a pointed query so you can get it. Hi. So I had the question around the sensor operator. You said it bates on the files to come and to pass it through the... So I wanted to ask that, can you configure it to wait on multiple files and for a particular directory and size of the files? Because what happens is sometimes you pull data and the job sucks or successes without pulling the entire data. Let's say putting a consumer from a Kafka. So can you do that with that operation? Yeah, totally. You can customize a sensor operator whatever way you want. There's a concept of HTTP hooks. There's a concept of hooks in the Airflow. So you can write your own adapter for that. So yeah, it's very much possible. And it's very simple to do that. Thank you. Hi. There are a lot of frameworks to manage data pipelines. What this particular framework solves in business case? Sorry, the question is what Airflow brings to the table? Yeah, what kind of a business case it particularly solves? And there are a lot of frameworks, like, you know, you have, say for example, distributed computing, you have different frameworks to manage. Kafka, you have a single console where you can upload your jobs, stuff like that. So what it solves? This particular framework solves in a business sense. Yeah, so let's assume that you wanted to build statistics around your customers, right? You will write your own Spark job and the data, all the events that you collect to Kafka, right? All the events getting collected to Kafka and you wanted to find out how many active users that I have, right? And you write a Spark job to do that and you collect all the data to Kafka. And it's not very simple to write, like, in one Spark job. You will write into multiple stages of jobs to do that. So what Airflow brings to the table, it brings an orchestration engine. So how do you like couple all the individual dependent jobs, right? One Spark job don't know about what is an extra Spark job. And all the data that you collect from Kafka landed it yesterday and how Spark job know that the job is actually landed, the data is landed and it has to run out of it, right? So Airflow provides a data pipeline orchestration engine. That's what it is. Can you talk about a few use cases and having a small team of 12 people, if that has been restrictive in terms of being able to serve business use cases? Sorry. So just a few examples of use cases that you've implemented on Airflow? Okay. So if you're using Slack and you get a billing info, like, okay, you have to pay this money and some job run on Airflow that basically produce a report for you, right? And so we as Slack follow the fire billing policy, what it means that it'll charge only for the active users, right? So the definition of active users is actually calculated by one of the Airflow tasks. And the amount of mail that's spent to the users also as Spark job. So and the sales prediction, okay, who is the potential customer going to convert and how can we reach people, right? It also is an Airflow task that's do that. So you can imagine across all like sales, marketing, billing, what are the predictable analysis? Even if you go to Slack and you search something, there is an Airflow task runs every week and build the whole solar index offline. So this is, Airflow has been used primarily as a core engine for all those pipeline-based operations like batch-based operation, whether it's the statistics of the customers, billing statistics, offline search index and creation, how does our performance looks like, right? Slack performance, okay, for this API call, how the performance looks like and compare over the period of time, how do we improve these things and all that. So the entire business operations, the statistics that generated is actually on some Airflow. This is a question on round scaling. So initially you said that you were able to scale it on local mode, right? So how is it that you're handling scaling in terms of jobs, which when you kind of lose throughput of the CPUs? And the second is in case of failovers, you said that lags are not updated. So if you want to return to a point in time, how do you go ahead and return to a point in time, especially when you have changed all the lags together? So I missed it, can you repeat the first question? So how do you handle scaling in local mode? So you are running everything in local mode, so how do you scale your infrastructure to that level? Yeah, so we use M4 10x large machine, which is a very high CPU AWS instance. We reserved a bunch of them for other use cases, so we have and we get it very cheap, right? So this is not very costly. So we scale up in terms of the boxes. It has a very high-core CPU, so it can run multiple costs at the same time. Yes, so you can still scale out in Airflow with local executors. So the way Airflow works, the scheduler, you have a DAX folder. In the configuration file, say that this is my DAX folder and all the Python file inside the DAX folder is basically some kind of a DAX. Yeah, probably I can take it offline. In the interest of time, take it offline. Thank you. Thanks.