 Hi everyone, thank you for joining me virtually. I'm here to talk about Apache Airflow's differentable operators and how it can help you save tons of money. Before that, a little bit of background about myself. My name is Kaksil Naik, I'm the Committer and PMC member of the Apache Airflow project. I work at Astronomer as the Director of Airflow Engineering. And my main work is improving Airflow directly by contributing to it or building tooling on top of Airflow to improve the deck authoring experiences, which has been my main focus area for the past few months. So let's dive in with a quick refresher on Airflow for those of you who are not aware about Airflow. What is Apache Airflow? It is a platform to programmatically author, schedule, and monitor workflows. Author it in Python using the DAG, schedule it using the Chrome expression, a time dealt object, or with the newer versions of Airflow, we want to support custom timetables. And monitor workflows using the Airflow's UI, the Airflow's web server where you can check the status of the individual task and also the DAG. So let's look at this basic example. So very simplistic ETL DAG, which extracts some records from somewhere, let's say, and there's some sort of transformation here. It's a very simple transformation that adds the data that is available and then loads it somewhere. And the figure on the right just shows you how the dependency graph looks like and what the output of the DAG code is. Very simple enough, simplistic. And yeah, that's it. Like, this is Airflow. It's an orchestration tool where you can write your workflows in a way that you can version them and then just run them and even observe them. And it has been the defect of standard for the data engineers of the world. So let's get back to our main topic for today. Let's talk about deferrable operators. That's why we are here today. So in today's presentation, I will first explain why do we even need deferrable operators? What problem do they solve? How do they work? And finally, how to find the available operators and available deferrable operators and use it. And if you want to build yours, how can you build that? How can you build your own deferrable operator? So it will cover at least these three topics. So main question, why do we need deferrable operators? Aren't the standard operators perfect and just work well? Let's answer that. So let's first look at how a flow runs the task you define in your DAG. Like I showed in the figure, right? You have some DAGs defined. You have some tasks defined in your DAG from there. Or how does F flow pick it up? So F flow DAG processor parses your file continuously and stores some metadata for it in the metadata base. The scheduler then checks. The scheduler first then creates a DAG run for your DAG and then creates task instances for each of the tasks that is available in your DAG. The task is the instantiation of your operator. Task instance is saying that my task is running at this time. That is a task instance. So first creates a DAG run for your DAG, then it creates a task instance. Now the scheduler checks whether the task dependencies are met for the task instance or not. If they are met, it marks it as queued. It marks the task as queued and sends it to the executor. The executor, I have not shown the executor here, but executor is part of the scheduler currently where it just runs the executor. The executor now takes that task, takes that queued task and figures out how to run it depending on type of the executor. Airflow by default ships with different types of execute. Seller executor, Kubernetes executor, local executor, sequential executor, and a few others, but these are the widely used ones. For Seller executor, the task is sent, first sent from the executor to the broker, which is, which can be Redis or RabbitMQ, which then assigns it to the worker based on how much load is already on the worker. Now, the worker runs this task. For example, if the task is using batch operator, it runs the batch script or if it is using a Python operator, it just runs a Python script. This is a standard task, typical operator, but yeah, this is a typical operator. There are different kinds of tasks that you can run, right, to one that just runs a script like the batch operator or the Python operator that we talked about, or does any sort of CPU intensive work. You can like do some intensive work, even with batch operator. You could run your machine learning model. You can have a specialized operator that does machine learning model training. But in this case, the worker is doing something compute intensive, right? In this example, it is crunching your data, it's processing your data and using all the available resources, CPU has pi 200% for it because you are crunching data and doing NumPy, Pandas, and everything. So we can't do anything about that. Your worker, your resources are being utilized perfectly fine. So, so far so good. However, there are also different types of tasks. Not all the tasks are like this compute intensive or memory intensive tasks. Some of the tasks are similar to one that is shown over here, which is, I don't want to compute stuff on worker, I want to offload it to an external system that is specialized for it. For example, Dataproc. Dataproc is managed Spark service in Google Cloud platform. So I would like to submit a Dataproc job from my Airflow worker, and then just pull the Dataproc operator, we will just then pull to see if the Dataproc job has completed successfully or not. In this scenario, let's say my Dataproc task takes one hour. Your Airflow worker, one particular worker slot is completely blocked. It used to first initial 30 seconds as an example to just submit the Spark job and then all it does for 40, 50 minutes is just pull, pull, pull and sleep, pull and sleep. And once it completes in the last, let's say five minutes of your hour for one hour, it completes the, it marks your task as failure, marks your task as succeeding, saying everything fine and done. So, in the first example where everything was so CPU intensive, task was CPU intensive, the resources were utilized well. But how are the resources being utilized well over here? The resources are wasted because in most of the 50, 55 minutes in this example, the CPU was idle, the resources were idle, all you were doing is sleep and pull, sleep and pull, which is not efficient. This is a typical operator, right? Not all operators are like this, but most of the operators are like that. People want to also that work to specialize tools and just use Airflow as an orchestrator. So most of the operators might be like this. Then there is different kind of operator, a special kind of operator called sensors, whose only job is to wait and pull until a criteria is met. For example, I want to wait until my file or until my data has arrived in S3 or not. And then based on that, I want to perform a lot of operations on that. So, typically your sensor is the first task of your bag. Now, there are companies and I know some of our customers as well who have, let's say, 100 bags and the first task in each of them is a sensor. 100 tasks, 100 tasks are just waiting for some operation and on the worker, all of them are just waiting, just pulling. Just imagine your resources are being literally not utilized well, inefficient in search of your resources. You can add gigabytes to your worker because of some few tasks that are CPU intensive. So those resources are not used well. This is my main point. So you can see a pattern with this type of submit and pull based operators or all the sensors. The pattern is the resources are completely wasted when what we are doing is literally waiting for this criteria to be met for after the initial compute intensive task or after doing something and then submitting to an external system. So, in this case it is pulling for the first example shows the pulling for the Spark register and waiting for the, or waiting for the files to arrive for the S3 sensor. All that a worker thread is doing in that ideal period, the one that is being cleaned, all that a worker thread is doing is waiting on an external system to complete and send the result of it back to the worker. And due to that, the worker slot will be blocked until your task gets completed, until your external system does okay all the work and then we'll move on to the second task on that particular worker slot. Now, imagine if you have tons of tasks like that like tons of sensors, tons of operators that are pulled based. A GCS sensor and S3 sensor, a Spark operator are going to be query operator where you just submit a query and then wait for the query to complete its job. And you have hundreds of tasks called let's say 16 tasks. By default the salary executor within a flow has 16 slots. And if all your 16 slots are doing, are waiting for an external system to tell whether the job has been completed or not. Your entire worker is blocked with inefficient tasks which are all idle and waiting for the task to be completed. We need a better, we need a better solution to this like this is inefficient usage. And this is where the deferrable operators are also known as async operators came into the picture. This is our engineers at Astronomer Michael League Andrew Godwin, he and his team added this feature in Airflow 2.2. And he gave a really nice talk about the technicalities of it and this year's Airflow Summit. And this talk in my talk is just building on top of it. So I highly recommend watching Andrew's talk in Airflow Summit. A lot of today's content is based on that. Let's let's think about that in the middle part in this previous example here, pole spark cluster was being wasted or the first part in sensor was being wasted. What if they moved it to a different component? Here you see a new component called trigger that is where we are moving or that is where we are offloading the work with the idle work. So we move this middle part of the last figure that is waiting for an external event to succeed this new component, which was also like I mentioned was introduced in Airflow 2.2. This component like this trigger component is designed to run those sort of workloads to run like thousands of asynchronous Python code. In a single trigger or process where most of what this code does is pole and sleep and continues to repeat itself. This is the main job of the trigger or else you'll think, hey, you're just moving the load from worker to a different component, how are you actually optimizing it? I'll explain it in detail, don't worry. Async operator, this is what it looks like. When you use Async operator, the figure will look like this where you have two components instead of one like we had worker in our last slide and now we have even a trigger. With our earlier Spark example, the worker submits the Spark job and stores the job ID in Airflow's metadata DB and deferred itself. Don't worry about this terminal, they'll explain each of them. With this Async operator, instead of once this Spark job is submitted and instead of holding it on the worker, what we are doing is, is retrieving the job ID of it, storing the job ID in Airflow's metadata database and suspending the operation of the task from the worker and instead running it on trigger. The trigger now retries that job ID. It says, okay, given the job ID, now I'm a special trigger, I know how to pull using that job ID. Then this Spark cluster is pulled again and again until the job is completed. The trigger is not completed once or it fires when once the job is completed. And depending on what your trigger returns, it can store the API response in the metadata DB along with what next method I need to run. I'll explain that as well in more detail. The only thing that you need to know right now is instead of submitting the Spark job and still polling on the worker, now we are suspending it from the worker and instead running on the trigger. And freeing up the worker slot so that it can run different jobs, different tasks in that free slot. Once the polling is complete on the trigger, it will come back to the worker. The scheduler will create a task, use the same task and bring it back to the worker with the logs that it saved in the trigger. So it's like, hey, I'm back to the worker. This is what your API responded task completed with so-and-so response here to show you on the UI. Slight shows the difference of how an async operator runs as compared to the non-async one. If you are using salary executor, one worker slot would be entirely consumed when running task one, even if it is idling most of the time of its execution and then run task two. Instead of, instead, like with async operators, you can efficiently use the worker slot and move the idling part of execution to the trigger so that other tasks can use the same worker slot. This slide shows the same process that I just explained, right, that the task runs on the worker, then differs itself, suspends its execution from the worker and runs on the trigger until a criteria is met. Then the execution of the task, once it's complete, jumps back to the worker and completes its execution and set the state of the task. Okay, I completed or I failed or I completed and failed or I completed and succeeded. Now, I mentioned triggerer a couple of times and trigger a couple of times. Triggerer must be easy because it's a new component that runs a different type of workload. Trigger, on the other hand, is a new concept. It's the workload that runs on the triggerer. It's a new concept. It's a different concept than the operator. The trigger is different in a way that it has to comply by some design choices that we have made for it to run efficiently. It must be written asynchronously using Python's async.io module so that thousands of this, so thousands of this workload, thousands of this triggers can run in the same process. There should be no blocking synchronous call like a database call in the trigger. There should be none of them. There should be modules like sync to async to make it asynchronous if you want to definitely do a database call. The second thing is trigger should not store any persistent state. With a trigger, you can't store anything locally. It can retry some information from the after meta database when it starts running and send some information to the database when it completes its execution on the triggerer. But it should not store any data locally on the worker. Why we have done that is this allows us more reliability for the trigger and trigger. Such that if multiple trigger processes are running for HA and if one of the triggerer dies and it was running some triggers, the other triggerer can now rerun those triggers. Or let's say for simplicity, even if there is just one trigger running right now, it is running four or five different triggers. We can restart this component if something goes wrong. Whenever needed without compromising execution, the new trigger that spawns up can can just fetch from the database which trigger it needed to rerun and it will just rerun them. Basically allowing us to shuffle these triggers between different triggers. And lastly, it must support multiple copies of itself. This is so that in case where a triggerer loses network connectivity, and hence probably loses heartbeat cluster, the community cluster if you have will think that the trigger process died, and it will spawn up a new trigger. Which will run all the triggers that were running on the first trigger. It makes it fall tolerant, right? So you should design or you should write your trigger in such a way that you can run multiple copies of itself. Now, let's take a look at an example trigger. This is a very simple example of a trigger that waits for a specific date and time. We ship a better version, better return version of this trigger with effort itself. But for ease of understanding, let's take this simplified example. The trigger fires itself when the condition is met. In this case, it fires when the moment or date time that is passed is reached. For a different trigger, for a different kind of trigger, like this is the time trigger for different trigger like S3 trigger, the condition might be different that did my files arrive in the S3 bucket or not. And it will poke and it will sleep. Here as well, it will look for that time if it has arrived on or it will just sleep. And that logic is in the run function. Implementation wise, you just need these three methods on the trigger. One is the initialization method, the init method, which is a classic standard in method nothing separate serialized method and the run method. The serialized method defines what information should be stored in the database. So when the worker suspends the task and stores some stuff into the database, this is what it stores. It's like, hey, I'm going to suspend my operation, store something to the database so that the trigger can create a trigger object from that. That's why there's a parity between the initialization method and the serialized method. And then the main logic is in the run function, which is a quarantine function. And all it does is it has to yield, it has to yield a trigger event where it fires. In this example, it has this infinite one that just rates and sleeps until the moment until time is reached or has already passed. Then this happens to trigger yields and this is stored in the database. You can see that it doesn't use standard sleep and instead it is using async IO sleep with await keyword. This is, this is async code. It is asynchronous code and tells Python to suspend this coroutine and then run other coroutines in the meanwhile. And then as soon as it frees up running that coroutine, it will come back to this coroutine again. It knows how to do it. That's its job. So we are just using that concept over here. On the operator side, we have a new method called differ, self dot differ, which you need to call in the execute method of the operator. If you don't know what operator, how the operator runs it just runs whatever that is in the execute method. The differ method here, like self dot differ is called in the execute method. And it tells Airflow that it tells Airflow and its worker basically to suspend execution and suspend itself from the worker and instead run it on the trigger like you need to pass two things for for suspending it. The trigger to run. Sorry. The first thing is to pass the trigger that it needs to run. And then the second thing is the method that needs to be run when when it is back to the worker when the trigger files on the trigger. What method does it need to run the worker that is defined by the method in the second parameter. For example, we'll just pass it execute complete. And we don't have anything in the execute complete method right now. But if, if you want to do, if you want to show some output, if you want to show some long, you can what you can do is in your trigger. You can go back here when you are yielding your event in the trigger event, we just currently are passing start to a movement but instead you can pass a response saying hey, tasks succeeded at here is what the API response was and then you can go and you can show here in the execute complete method where it also takes an event argument that event argument is populated by the trigger itself. So it's like when you show it in your task, you can just have hey event. Save.log.info event and then show the event message. So this is a way for you to communicate what happened in the trigger aside to the task because only the task logs on the UI will not contain trigger logs. And I'll talk about that in a second as well. You might, you might now tell me, this all looks really good, but are there any readily available sync operators or do I need to write one for each of them? The answer is a big, big, big yes, that's why you see a lot of gifts in the background over here. We have shipped some of them in the air flow repo itself and some of them we have shipped it in air flow and air flow providers and then we have built here at Astronomy we have built a few more operators like tons of, not few I'll say tons of asynchronous operators, 50% synchronous operator for the community to use. I will talk about Astronomy Providers for a second. These operators are part of Astronomy Providers repo where we have built, we have built and written example bags, tested them, added some documentation and we are going to maintain, we have maintained this and we are going to keep on maintaining them along with along with the example dog integration test and everything for each of them. You can install it using the PIP install Astronomy Providers and then if similar to air flow extras if you want to use Snowflake provider then you can do PIP install Astronomy Providers and add Snowflake as an extra to it and it will install the Snowflake dependence that I needed to run the asynchronous operator. This library is Apache 2 licensed and completely open source and we have built it for the entire air flow community, not just for Astronomy Providers, this is for entire air flow community. So if you have any feature requests you can just create an issue over there if you can take care of it or if you find but then if you want to help us with that, please feel free to create PAs for it. Most of the operators that we have built are dropping replacements of the synchronous version of the operators. So the only thing you need to do is replace the import line. A good number of these operators also has open lineage support. So if you are using open lineage support you don't need to do, if you are already using open lineage with air flow you don't need to do anything extra. We will just work out of the box. Some of the operators that are available are like all the major cloud providers AWS, Google Cloud, Microsoft Azure, major data warehouses, Databricks, Snowflake, Kubernetes, Apache Levy, Apache Hive and the core providers HTTP and file system. This slide just shows an example usage of one of the asynchronous sensors, S3 key sensor, where you only have to replace the import line. The rest of the text stays the same. This is a drop-in replacement, right? If you are already using S3 pre-sensor, you can just change this import line. The import line follows a single structure as air flow. So from SNOMA, the providers, IMS or AWS, sensors S3 import S3 async, the S3 key sensor async as S3 key sensor and that's it. Everything else is the same. But if you don't find the asynchronous operator that is already, that you need and it's not already available in SNOMA providers or air flow and you want to build it yourself, it is not that difficult. That's why what I want to do is give you a list of caveats if you want to build an asynchronous operator for yourself. So this is what you need to be aware of. You will see greater benefits of the deferrable operator or the deferrable operator and async operator are the same thing. They are just used interchangeably even in the Afrodocs as we can see in this presentation. You will see greater benefits of this deferrable operator if the time taken to pull or weight idling is on the higher side. Which marks that we did at SNOMA when the idle time spent by the synchronous task on the worker to just like sleep was more than 10 minutes. The reduction in resource usages by using asynchronous operator was 90% If the wait time is larger, my point is if the wait time is larger, you will see longer benefits, you will see greater benefits of using asynchronous operator. On the flip side, if the wait time is less, it's less than 20 seconds, then the overhead of suspending the task from the worker and spawning it up on trigger until it fires and goes back to the worker. It's just not worth it, right? 20 seconds is too less of a time for this overhead. So my recommendation will be to only create an asynchronous operator if the wait time is at least 20-30 seconds. Like I won't create an async operator to create an empty table. It's a simple operation. I'll just submit an append. The thing is that the table is created, so that is no benefits of using an asynchronous operator out there. So make sure you know that building an asynchronous operator will be beneficial only then do it. So be mindful of this. The second is not everything can be deferred. For suspending the task from the worker to trigger, you need a unique ID so that the trigger can pull an external system on it. In the last example where we were talking about Spark and Dataproc, we will get a job ID from Dataproc, Google's Dataproc. We know how to then pull that using an API, pull for the job ID. If the external system has no reliable way of giving a unique ID, there's no way we can create a deferrable operator with enough confidence. And good example is Postgres. Postgres does not give us a unique query ID. So if I submit a query to Postgres, it does not give me a unique query ID to track whether that query was completed, whether the query is completed or not. I could not reliably mechanism Postgres operator. On the other hand, I was able to do that for Snowflake because if you submit a query to Snowflake, it will give you a unique query ID that I can use for a query. So not everything can be different. And lastly, the trigger logs, I think I mentioned this at least a couple of times in the presentation already, the trigger logs are not visible to the users on the web server. Only the task logs from the worker are present on the UI, are present on the web server. So make sure you pass enough information when the triggerer fires in that trigger event here. Here in the trigger event where you yield, make sure you pass probably just an object with enough information that you can use it in your execute complaint method and show it because that will be logged on the worker in the task logs which will be visible in the UI. So trigger logs are not visible in the web server and UI. And that's it. So thank you very much. Do let me know if you have any feedback or questions once you start using deferrable operators. Thank you.