 Hi, welcome in this last session the pie charm room for today our first speaker. It's Christian driving We're gonna talk about getting control of your workflows with airflow So hi welcome to my talk. Yeah getting control of your workflows with airflow I'm president reading and I'm working as a software developer at Blue yander So we have the booth also here if you're interested later on just drop by and ask any questions so Imagine the following scenario which I know personally from my daily life You are at a data-driven company Each night you get data from your customers and this data wants to be processed. That's how you make money Processing happens in separate steps. So for example, you have to take care that this data is wallet You have to book that data you apply some machine learning steps You have to take decisions based on the results of the machine learning and if errors happen Then you need to get an overview of what happened. Why did it happen? When did it happen? especially since most of this Stuff is running at night and you need to see at next morning What possibly went wrong and as you already might have guessed we have tight time schedule So time does matter and processing time What options do you have to work in there to to realize such a scenario? The first thing that comes to mind of most developers is doing it with gron And we also had many projects where we started with that It's a great way to start it works out of the box But you only have time triggers. You cannot say this gron job depends on that gron job Please start afterwards. You just say at some time Start so for example at 22 o'clock book your data at midnight Do the predict run and at 2 o'clock do the decide run Besides that the error handling also as hard you always have to search for the correct log files when something went wrong So now as I said, we have a tight time schedule and you would like to get Finished earlier. So as we see each of these steps roughly runs around one and a half hours So why not compressing that so we could do better We could here start the predict at Half before midnight and at one o'clock start to decide works most of the time But sometimes your database is slow sometimes you have other issues and then one run takes longer longer here The book data run maybe takes 10 minutes longer. The data is not there the predict run fades the decision run fails You're completely run fails, which is very bad when you discover it the next morning because your customer cannot get the data He wants so that's an issue with gone because of that when we use the ground mechanism we always had buffers and Yeah, if this schedule was not too tight that worked fine But what about the next step our customer sends more data the processing time gets longer And we need to find better solutions. Why not writing our own tool? It's so simple We just have to check that the first run stopped and that the next run will start That cannot be that hard and the start is very easy in multiple projects We did that and it worked for the first steps, but afterwards you see the limit soon You have maybe concurrency that you have multiple tasks running at once you need to know why What task failed you might not only wanting to do a timely triggers But also trigger manual things afterwards you might have want a new eye or an external endpoint At that point you have to take a decision either you can accept the limits That's fine. All your own work for implant a implementation gets much more complex than you thought initially So you are stuck We were in that situation also we wanted to harmonize all these workflow tools we had in our different projects and We had to look at several workflow at several open source implementations. There are many interesting things With many different properties. So for example, we also had to look at Spotify Luigi But this was more an HDFS space too, which was not in our technology stack And also several other tools in the end we decided for airflow, which is an open source project initiated by Airbnb therefore the name airflow Why did we decide for that? Well, the tool itself is written in Python. We know that we like it and The one thing that was really cool is that the workflows are defined in Python code So they are not sitting in some JSON files not sitting in some database rows But really each workflow is a Python code. You can Enter it in your version control system. You get all versioning and that's really very good way of managing that it has Most of the features I said you will run Within the limitations. So you can have a look at the present at the past runs. It has logging features It's great that it's extensible So you can write your own extensions and Python code and plug it in without having to modify the open source code But it detects these plugins. I'll tell moral on that later It is an underactive development. So at the moment, it's an Apache incubator project and people are Reacting on the pull request. So there's lots of traffic in there and you can see that it gets further It has a nice UI, which I will show you You can define your own rest interface and it's relatively lightweight You have two processors on a server and you need a database to store that information How does the workflow look like? This is the Python code I talked about Mainly it is a dynamic as a click Directed as a click graph each watch each workflow and You instantiate it you give some parameters like when is the first run what scheduling do you have you can give that in constant tags on other in time deltas also and You can define your workflow steps as operators. So here I tell moral about the operators later But we have three steps. We're doing we are booking the data we are predicting and we are we take the decision and The connection between the steps you do via the setup stream. So you say Before the predict happens the book data needs to happen and before the decide happens the predict needs to happen So this should be the graphic doesn't work here, but okay, let's go to the next Complex stuff. So maybe you want to have a fan in fan out We have more data and the brick predictor on takes longer We want to parallelize that and maybe we say we do some prediction for German customers and some prediction for the UK Locations so by that I can say if you predict Germany predict UK Both depend on the booking of the data at the decision depends on both of them So it's very nicely to describe and it will give you that graph directly by that you can build arbitrary complex Workflows you also have the possibility for decisions and for switches, but at least for us. We did not need them up to now So most of our workflows are quite linear just with a few branches in there So how does the nice you I look like I already promised you You have here an overview where you see what workflows do you have what is the schedule of them? And also what are the statuses most recently? So it's a little bit Small to read, but you have here saying which tasks how many tasks have run correctly How many tasks are running currently also you can see what are erroneous and what are currently up for retry You can run each view each you can have a look at each deck run Explicitly so you see here the sequence this is color coded Also that you can see when which step was successful which is currently running erroneous and so forth So this is a run that did not start. So this is just a Scheduled but starting was not done up to now the tree view shows you an overview of all the runs So here you see each Each column is a run day So you see for each day here these three days went correctly all green and the last run currently I had an issue here within the second step. It's yellow. This means it's up for retry So you get a nice overview on how did it behave in the past and currently? Also, which helps and that is a runtime view which for which you can see for example performance degradation where we see here we have three runs and These colors are all different tasks. So let's say this is the booking data step the blue one This is the prediction step and this is the decision step and you see one behave the same and The other two changed over time. So very useful for seeing which of the steps might have taken longer You can see each run also as a gun charge to see when Was each step happening and you have a log view which Which really is useful where you can output things like unfortunately. It's a little bit smaller here It says this is the decision task has started a job and the back end system and the job ID is 17 and the stator and the next The next iteration it asked what is the status now and then we see it is finished But that you can see how each task was processed Now what are the building blocks of your workflows? These are operators and there are already many operators delivered in airflow as an example You can operate or you can start things on the bash You can start things with HTTP request. You can execute statements on databases You can write directly Python code which is executed or you can send mails And this is just a few examples. There are more in the in airflow delivered They're not only these operators, but also sensors sensors are Steps in your workflow that wait for things So an HTTP sensor could for example always query and URL and ask whether it is finished or what is the status on that? And based on that it will wait or it will proceed in the workflow In the same way an HDFS sensor could check for files on the file system And then SQL sensor could check for values in the database Many things already you can do with these operators, but then might be situations when you need more For example for us. We had an asynchronous processing in our back end systems So we had here our airflow system. We had our back end system For example the machine learning system for the predictions. We wanted to start a job So we trigger and we trigger an HTTP request there We get back a job ID and then we let it run for five minutes half an hour or so And we constantly ask whether it is finished or not and when it is finished we can start the next job This would be possible to do already with standard methods of airflow So we could use the simple HTTP operator to start it and the sensor as I described to wait until it is finished This works, but it has The disadvantage that you don't see directly how long did it take So you remember the last view with the runtimes I would like to see how long did my decision take and therefore I wanted this Step decide has a certain length the length of this is the length that it took on the back end system So this is possible. We can do this with a new operator. I Want to explain each line in detail also you can find afterwards This as a complete airflow example plug-in on a github people which you can see afterwards I can check for each line So we have an HTTP collection defined We have some some endpoint decide that we can trigger that and then deliver as a job ID And we have a job status we can ask when we have the job ID. What is the status of that? So within the execution we run The post on the decide to get back the job ID then we wait for the job with the job ID and Once the status is finished we are done and then within the airflow database We know how long did this decision step take now? How do you get these operators into your system as I said? We don't want to modify the code the airflow code directly, but we can do this in a Python package We can we can say we have this plug-in that has some own operators that has some flask blueprints and Lay that in our file system and in the airflow configuration We just can say your plug-in is here and your workflow definitions are there and on the start of airflow It will detect them automatically Also that plug-in Is defined in Python as you can see here. We have the airflow plug-in manager and you just say Inherit from that airflow plug-in we have our Europe Python plug-in. This has three operators. I need and also a blueprint What is it about that blueprint? Why do you need that? We had The requirement we wanted to have an endpoint to talk and arrest style with our airflow system So that we can also programmatically say I want manually to start a trigger I want to know is a diagram finished or not This functionality was not there in an airflow, but you can write it as a flask blueprint You can define that endpoints and it is detected automatically and added within the web server Also, this you will see in the example repo How would such a rest endpoint look like we have here the airflow server running on port 8080 We have defined this endpoint trigger and we say we give the name of the workflow Which is daily processing and we get back the name of the workflow and the run ID Which we can use afterwards to ask for the status. So this works fine Now what happens inside of airflow it works with two processes it Had at least two processes. I should say we have a scheduling process that takes care When each job should run and we have a web server that gives the UI and all the other blueprints As you need database several databases are supported. We are using at our company the postgres and the sqlite Sqlite currently has a restriction that you cannot run parallel tasks on them But we are using the sqlite more for the development testing stuff So this is fine and for production you can use a postgres and there you don't have that limitation You can also look how do you want your tasks to be executed? We are using most of the time just HTTP requests. We are saying we trigger a task in the backend system We're waiting until it is finished. So the airflow system itself. There's no high workload on that So we are happy that this runs within one schedule within the scheduler process directly or one We want to have multiple tasks in parallel. We work with sub processes But it's also possible if you trigger the stuff via bash scripts or similar things that you want more power behind the executor notes itself And to do that you can use celery which is a framework with multiple worker nodes And you can use that there is already a connection from airflow to celery Yeah, how we use it most of the things are already mentioned in the meantime We use the automatic schedules and we have manual triggers. We use one airflow instance per system We manage so we also had how do we that connection? Do we have one central company airflow instance or one airflow instance per system and for us? It was easier to do it that way Databases we use POSCAS and SQLite execute as a lightweight and also we are contributing to airflow This is really good that that works fine this external triggers that you can trigger them manually They were not there one year ago and we needed them definitely before using airflow So we wrote a pull request that was also Worked with and now within these two pull requests. This is an airflow and we are also have some Necessary functionality for the plug-in detection. So we also open pull request there and that's an active communication with the community With all these good things about airflow there at least a few challenges I want to make you aware of because these were things we struggled a little bit and also with the project teams are using airflow At our side. This is has to do with how is scheduling handled and how is the start time interpret interpret it So scheduling There are two dates that are important for that. It's the start date This means when did the processing of this task of this workflow start on the server? So that's quite easy. It's the time of the server But there's also an execution date that is quite prominently shown on the UI and that sometimes shows strange values These values are consistent and they are explainable, but they are not always obvious The reason as the history from airflow. So this was used in ETL scenarios. So this extract transform load and This means that for each they wanted to process daily data, which Was accurate which was coming in the whole day long. So let's say on 19th of July the whole day data came in and then you wanted to process that data for the 19th of July And when can you process that data? You can process it only after the 19th, which is the 20th So let's say today. So today this task of data processing runs and what is the execution date? It's the 19 So it's always one iteration back. This is because they said originally. Well, this is smaller description This is the data from the 19th. Therefore, it's the execution date That's fine when you know that it does not scare you that you think the system is doing why things But that's consistent. But yeah, you have to get used to that We have some workflow starting in a weekly schedule, which means when I trigger that now It gives me the start date of Monday the week before also that it's consistent But you need to get used to that Then we have to start date you might remember for 10 minutes ago that we give a start date for each workflow and If the workflow is scheduled automatically And you start the server it will know That it has to fill up tasks. So when we say Start date is today 20th of July No, we start the server at 20th of July the scheduler and a start date We have given the 17th of July then it will detect that there are some runs missing and it will fill these runs automatically So it will first trigger the run for this with execution date 17th Then execution date 18th and there's a regular run then the 19th will be processed at the current at the correct point of time You need to check whether this is applicable for you. So when you have these things I need really to process this data That's fine when you have more the thing I want to trigger something in the back end and I need to trigger it just once because this back-end job will take care Of all cleaning up that stuff then this is a little bit strange and can lead to issues when you trigger it too much But you can work around with that when you give the correct start time already So you can determine that in code you can determine that in a variable. There are several options I won't discuss them in detail, but this is the thing you should have You should have in mind when you do that when you wonder why does the spec fill happen? It's possible to handle that but you need to know the concept behind that if you have some further questions Maybe we can discuss afterwards Okay, and that's it also from my presentation. I give you here. That's the incubator project for airflow It has a nice documentation which is here also very useful is the common pitfalls page in the airflow wiki They're also the stuff with the execution date is explained in more detail and The plug-in which I have shown you parts of you can find here at our blue yonder repo You can download that you have the steps in the read me on how to use that in your air for instance So that's it from my side Yes, hello, thanks for the presentation You showed us the GUI Is it possible to to manage test dependencies in this GUI or is it just to display things that you That you wrote in the code The workflow definitions itself you do that in code so you can view that code from the GUI But you have to change it in your code editor Okay, because well in our in our firm we've got a homemade scheduler and Well with I think 100,000 tasks inside and is it Scalable airflow do you use this amount of tasks in in your system? We know we don't have that high data volume in our system. So for us, it's more that we have Persistence these these nightly runs that have several tasks, but not a thousand some millions of that I've seen in the documentation page from Airbnb these decks seem to be much bigger So it would be worth to ask them what is the limit for that, but we did not reach it up to now Hi, I would like to ask about the execution date and the run date Okay, is it possible to configure it because we have like similar example when You run you collect data for last month and you want to run in for example 15 days later Or in the opposite direction you want to like collect data for next month and running 50 days before that like is it If you can configure it like the delay or maybe even if you can postpone it if you see Okay, I will run tomorrow, but if I have no data tomorrow, I will retry the day after tomorrow Well at first the logic is not configurable so this is in the scheduling code itself Regarding the stuff running two weeks after or two weeks from now I would say I have no quick answer to that Maybe we could discuss it afterwards I think you can do many things with the scheduling so because the scheduling just helps you when does it run You also have to poss- the possibility to schedule a run each day and as a first task of the run Decide on whether you really want to run or not So this might be the first iteration when you say you implement the more complex logic than in your first task But maybe there's also other things. Okay. Thank you Let's have one more question Did you evaluate other tools when you decided about airflow and why did you decide for airflow? Yeah, for example, we had to look at Luigi, but that was based on an HDFS stack Which we did not run so therefore this was too heavy-weight just to have an workflow system to set up this We also had to look at several open-stack implementations But their main focus was on doing heavy work lifting with execution processes and how these are distributed And since we had very lightweight processes, but needed more UI features and more these More possibilities to define own operators. This also just was not the main focus When you see this two things great, but the the main focus is a different thing then yeah It's good to have a look at all that was because in theory also Jenkins does a lot of things at that with some plugins So if you already have Jenkins, they have to convince your team to use something else So what could be one thing that you can do that? Otherwise you cannot do I mean Jenkins is great We also use Jenkins for our integration testing for scheduling our unit tests, but not for not for the daily productive runs Okay, let's give a big hand to Christian