 Thank you for coming to my talk. It's going to be on Airflow and Workflows, but first just a few words about me. I'm Michał Kaczyński, I work with Python, but also with JavaScript and with Linux. I also write a blog, which some of you may know, and I'm currently a tech lead at Intel and also a consultant at a company called Atari. All right, so let's talk about workflows. How many of you know what I mean when I say workflow? Raise your hands. Okay, that's like four people in the room, so it's a very vague term, but it's a popular word also, so it can be very confusing when I'm just saying that I'm going to talk about workflows. So I'm going to define them a little bit to narrow down what I mean. When I'm talking about workflows, I mean that we have a sequence of tasks that are started on some schedule or maybe triggered by an event happening somewhere, and these will carry out some work for us. This is frequently used with data processing pipelines and other jobs in the big data field. A typical workflow could look something like this. I get some data from some data source that came in and I'm downloading it for processing. Then I'm sending it off for processing on a system somewhere else. Then I have to monitor whether the processing was completed successfully and if it's done. And when it is done, I get the results of the processing back. I generate a report and I send this report out by email to some people. So that's a very typical workflow example, but this workflow methodology is so generic that there are examples almost everywhere. So many types of ETL workflows could be defined. Data warehousing is also a place where you would use a workflow. You could use a workflow when you're doing A-B testing to handle some of the automatic steps for you. Anomaly detection is another area where workflows are used. Training recommender systems that were just presented in the previous talk are probably using some workflow system to get this job done. And orchestrating automated testing. This is actually what we are using Airflow for at Intel, so that's why I marked it there. And another example from Bioinformatics field, you could be processing some genome every time a new genome file is published somewhere. So all of these jobs could be handled by a workflow. Because of this, a whole slew of new workflow managers has started to come up in recent years. And these, which are listed on the screen, these five, are just a small subset of all of the ones that are available. These are the more well-known ones. But I will be speaking today about Airflow. And Airflow or Apache Airflow is an open source project. It's written entirely in Python using some well-known Python open source technologies itself. So it's based on Flask and uses Celery. It was originally developed by AirBnb, but it's grown very quickly and extensively over the last couple of years. So it currently has almost 300 contributors, 4,000 commits, and many, many stars on GitHub. And it's used by hundreds of companies. We're using it at Intel. I guess AirBnb uses it. And Yahoo, PayPal, and many others are using it also. And Airflow provides you with three things. It provides you with a framework for writing your own workflows. It provides a scheduler and quite a scalable executer for running the workflow tasks. And it also provides a web UI for monitoring the workflows as they are running and for viewing logs. So in this talk, I will focus primarily on this first point, on the framework which you can use in Airflow to define your own workflows and tasks. I won't be speaking very much about the executor or the scheduler, but there already was a talk on that this morning, a good talk by Federico Mariani, who was speaking about Airflow also. So if you missed that one, I'm sure you can find it on YouTube later. All right. So before I begin showing you code examples and other things, I will take a minute to show you Airflow itself, give you a demo. So when you set it up and run Airflow on your computer, you will get this web interface to it. So it lists all the workflows you have defined in the table. And if you click on this little play symbol right there, you will be able to start a workflow manually just like that. And then you can go in and take a look at your workflow. This one is called Hello World. And you will see that it's currently being executed. It's running. This task already managed to complete. This task is going to be scheduled in a second. And you can see the whole history of your workflows runs. And for each task, you get an entry in this table. You can click on it and view the logs of a particular task. If any errors occurred, the logs would be here. You can click on the graph view. And then you will see another view of the same workflow from which you can also view its logs. So my Hello World example returns Hello World. So it works. So that's the UI. And it's actually very easy to get to this point. Installing airflow and setting it up is quite easy. But I will be talking more about the code needed to write workflows in a second. But before I do that, I want to talk about what actually flows in a workflow. Why is a workflow actually called a workflow? So every task that we have in our workflow makes decisions. And those decisions are based on the input to the workflow that was, to the workflow run that was started. And also the output of upstream tasks. So all information flows from downstream to, from upstream to downstream. So it's kind of like a river. And I want you to think of a river for a minute. Like a river, a workflow starts, begins somewhere. So it has a source. It may have many tributaries which join together to form this river. It also ends up somewhere, like a river flowing down into the sea. Or it can form many final branches like river delta. What workflow also does is it can have branches. So okay, this is where the analogy breaks down a little bit. Because rivers don't usually do that. But workflows do. They can have very many branches which can split up from the main branch of the logic of the workflow and then join back together to form the final workflow result. So this isn't really a river. It's a graph. It's a directed icyclic graph where the information always flows from upstream to downstream. And you can actually use that very creatively when you're designing your workflow. Because if you put some information into your workflow at any point, it's like putting a message in a bottle into the river at some point. It will flow down and pass every point. I guess you would have to put many bottles into the river if you wanted it to reach every point. But the point is you can put information upstream and it will flow downstream. So if I put some information at point B, it will be available to all points in the graph downstream of that. If I put some information at point D, the same thing happens. And then finally at this end point where all the branches combine, I get all the information and I can generate my report or do whatever I need to do with all this information. So when you're writing your workflows with this in mind, you can make them quite modular and make them use information from sources upstream in the tasks that are running after that. Okay. So that's enough about reverse and about the magic of the graphs. Let's get to airflow and how airflow works with this. So airflow uses the concept of the directed-icyclic graph for all workflow definitions. And that allows you to define the logic of your workflow as the shape of the graph. And this is very easily done. This is actually a complete code example of the Hello World workflow that I was showing at the beginning aside from missing some import statements. So I'm going to walk you through these couple of lines, but it's very simple. So first of all, there's some Python function that I want the workflow to execute. This one just returns Hello World. And then I define the DAG just by specifying a couple of parameters. And using that DAG as a context manager, I define a couple of tasks by instantiating these operators. So first one is called dummy operator. Second one is called Python operator. But these create tasks. And in order to combine these tasks into a graph, I can use this bit shift operator that was overwritten to allow joining tasks together. So this method of defining graphs is very quick and easy. And when you get used to it, it allows you to create graphs as complex as you need. And moreover, since this is being defined in Python code, you can use any looping logic that you want to define more complex graphs. The next airflow concept I want to talk about is the operator. This is the way you define the actions of a single task. And operator is essentially a Python class with an execute method. And that's all you have to create to have a very robust entry in your graph and in your workflow. Because this will automatically be retried. And if it fails, it can be repeated until it succeeds. And therefore, each of these functions should be end-impotent so that if it runs multiple times, it won't have unintended consequences. But an example is just this simple. In fact, I made this slightly more complex than it needs to be because all it needs to be is a class with the execute method. But I added this one parameter up there to show you that you can also parameterize your tasks by definition of the DAG when you're putting them in the final DAG by passing parameters through the init function. Another concept airflow uses is called sensor. And sensors are long-running tasks. This is very useful for monitoring purposes. So if you have some data processing job running somewhere, you may want to check on it periodically to see if it finished. And airflow gives you the ability to do this very simply if you define sensor class with a poke method. The poke method will be called repeatedly until it returns true. So a very simple example is this one. I have a sensor with a poke method and this one. This example is slightly silly, but it just checks if the current time, the minute is divisible by three. And if it's not, it returns false, which means that the method will be called again after a certain number of time. I think it's one minute by default. And until it returns true, it will be called again and again. And then finally, when we reach the point where the current minute is divisible by three, we will return true and the sensor will exit. Another very important concept in airflow is XCOM or CROSSCOM. It's a means of communicating between tasks. And this is just actually a way to save things in a database, a simple way to save things in the database and then retrieve them later. So because these things, these messages that you pass are saved in the database as pickled objects, it's best suited for small pieces of data like object IDs rather than whole objects, but it works very well when you use it this way. So it's very easy to use. In your operator, in your task, in the execute function, you have a parameter called context. And if you just retrieve the task instance, the running tasks instance from this running execution context, you can call XCOM push function to pass some information into CROSSCOM. And then in another task downstream of that one, you can call XCOM pull to retrieve this information and use it later. And you can also do a trick for scanning all upstream tasks by using something like this code example, which has these three lines in the middle here, where I'm getting all the upstream tasks from the graph. And then I'm calling XCOM pull on all of the IDs of the upstream tasks and querying all the upstream tasks for a specific piece of information. And I get an array of all the defined database IDs in this case, for example. What you can do when you're defining your workflows with airflow is actually create reusable operators. And this is what makes airflow workflows very modular. Because if you use loosely coupled functions as your operator functions, meaning that you have only very few necessary parameters passed in by XCOM, and most other parameters are optional and have same defaults, then you can put an operator like that, a task like that, in very many different types of workflows. So in this example, I have a pink operator called Nexus, which can be something that collects information from a lot of upstream tasks and combines it somehow. But it can also be used in a different place, in a different graph, where it doesn't have all the same information coming from upstream, but it knows how to behave well in that context as well. And it plays a slightly different role in another workflow. So this proved to be a very powerful technique for us when we're doing our test orchestration using airflow, because we're defining blocks of code which fit in many different places, and we're able to combine them into very many different workflows by reusing the same components in different contexts. So if you pay attention to these details, you would be able to do the same thing. So let's look back at the typical workflow that we started with. The tasks that are part of the workflow are defined in airflow through the operators. The one that's used for monitoring the processing is a long-running sensor, and all information that passes from upstream to downstream tasks goes through this cross-com functionality. Okay. There are some more interesting things that you can do also. For example, if you want to follow a certain branch of the graph and skip others, you can use an operator called the branch operator or the branch Python operator. So that's a very simple example up there. You have a graph with three tasks, one upstream and two downstream, and then the upstream task, the branching task decides which task to follow simply by returning the ID of the task downstream that needs to be executed. All others will be skipped. Another way to skip tasks and therefore maybe skip entire branches of your workflow that you don't want to execute is by using a special kind of airflow exception called the airflow skip exception. And this exception will force this particular task to be skipped, whereas all other types of exceptions, if they're not caught, will cause the task to be retried. And if the retry doesn't ultimately work out, then they will fail. So this skip exception is like putting a dam in the river, you're stopping the flow of the workflow downstream, but you can actually control whether you're really stopping the execution by deciding what trigger rule your tasks, your particular task has. So by default, all tasks require all upstream tasks to be successful, but you can change that to a different, one of the options listed up there. And the one that I find particularly useful is all done, which means that whether a task succeeded or failed, your downstream task will execute. And if you write your operators in such a way that they know how to behave, even when the upstream failed or was skipped, then you can continue the execution through your workflow going downstream. So all done is like opening the dam from downstream of task. You can do a lot of other very useful things with airflow that I'm not going to get into the details of, but for example, you can run bash commands as your tasks. So in order to execute bash commands on a worker, you can use the bash operator which allows you to pass in a bash script which is actually wrapped in a Jinja template. So you're actually running a bash script generated by a Jinja template. I guess an example is worth more than a lot of words. You can put a template into the bash operator, and this will be first executed, and then the bash command will be executed on the worker. So airflow also allows you to write a lot of plugins to extend it. So writing plugins is also very simple. You just create a subclass of airflow plugin and put it in the plugins directory, and then you can define a whole list of things that airflow uses and make them available to your instance of airflow. So operators, we already talked about, but you can also define menu links for the web interface that I was showing at the beginning. You can create whole admin views because it's based on Flask admin. So you can create additional views in the administrative interface, or even entire Flask blueprints that you can plug in, and it's actually very extendable, so I'm sure it will be useful for many cases that you might have. All right, that's it from me, and there's a tutorial available on my blog, so if you want to get started quickly with trying airflow, go there, and thank you very much. Thank you, Miho. If you have any questions, please raise your hand. So I wanted to ask, how does airflow integrate with distributed systems? So if you want to distribute your process, for example, you run some method, and you actually want to distribute it over multiple servers, how can you then, because you trigger like an execution of that process with airflow, so how can you then decide that it, for example, not runs on your local machine, but then on a different server somewhere else? So the underlying technology that we're using for distributing work along different workers is celery, and celery gives you a lot of control over where things get executed, so that would be one way. Other than that, I guess it would be more manual than automatic, but we didn't actually have to make these decisions because all our workers are capable of running the same set of tasks, so not going to give you a definitive answer, but celery is under there, so going that way would work. Hi. I have two questions, actually. The first one is like, is it resilient? Is it like do you use it in production at the moment? Okay, and do you have any use case for sensors? Yes, we do, because we are actually, like I mentioned, we're using it for orchestrating automatic testing, and these tests can run for a long time, so we're actually checking on the executors, which are not the same as our airflow workers, our airflow workers are just the ones who trigger the tests, but we're checking on the executors if they're finished running all the tests, and when they're done, then we can make some decisions about what to do next, and sensors work in that context quite well. Great talk, thanks for that. Two questions I have is, do you run airflow in more than one node, and the other question is, have you seen airflow being used to tasks that require manual input? Not sure I heard you correctly. Okay, so yeah, the first question was, do you run airflow in more than one node? Like, can it handle the workflow being running more than one component model than one node, more than one server? And the other question is, have you seen or have you ever have the experience of running airflow for tasks that require manual input, some user input? So first question actually has a good answer because we are running a triplicate of servers, so we have three web interface hosts, three worker hosts, three schedulers, and without actually having to do very much, airflow was able to behave very well in this context, so it was running in parallel on three different servers, all three services were running together, and they are able to exchange information, so when I click to view logs on the web interface, it's able to pull the logs from the correct worker, it knows where the logs are, so it's able to work on multiple nodes like that very well. In terms of manual input, to do that, we had to use, we had to create an API that will have a user interface, but through that API, we're actually calling airflow methods for starting some workflows with additional input from users. That is not something that comes out of the box, but we were able to extend it by adding some API methods to the administrative side. Okay, thank you for the talk, it was very interesting. One question that would be very useful for a project of mine is, can you actually group the operators and reuse the groups? When you say you want to group them, do you mean you want to group them logically, like in a class hierarchy, or group them into smaller workflows? Now for example, if I always use the same five operators in the same configuration, then can I put them into one over-reaching operator somehow? So there is an operator type that I haven't experimented with, and I'm not sure how well it works, but it's there for this purpose, it's called a sub-DAG operator. So you create like a DAG with a bunch of operators, a graph of the five operators like you were talking about, and then you use that as an operator itself. So you kind of put the whole graph into another graph, but I haven't used that, so I'm not sure how well it works. Okay, thank you so much, thank you, Micheal. Please give another round of applause to Micheal.