 So can you hear me a bit better right welcome everyone to the talk about the lessons learned from the migration to Apache Airflow My name is Radek and I work as a chief architect at skimlings. So at skimlings we do Commercial content monetization. I will explain a bit more about that I also in my free time work as a trainer at Framework training where we do some big data related trainings Used to work as a studio at a couple of companies including k4g data my lab and also involved in many big data projects in the past We've such companies as orange fca counter open X and many many more So agenda for this talk as I will start from telling you a bit more about what is That we do at skimlings like how what kind of data we process? How do we do it with the airflow and why did we choose airflow to process the data? The part one will be focusing on the airflow basic concept So if you don't know anything about the airflow yet, don't worry like I will give you some some basics So we talk about the components features and I will show you some sample code And then in the part to I will focus on some of the best practices that we developed over the last year back at skimlings And we mentioned few things about the deployment and what I like to call it the good the bad and the ugly So skimlings data panel pipeline so The longer version of what we do at skimlings is we monetize the product links in the commerce related content To earn publishers a share of sales So I like to talk I like to call it a better version of advertising because we don't actually show any Display advertising from the websites what we do is we give the publishers Like a snippet of JavaScript code that they drop it They include it in the website and that JavaScript code scans all the links to the external Merchants such as Amazon eBay and and so on and turns them into affiliated links So if the end user clicks on any of those links and then buy the product We give the percentage share of that purchase to the publisher So essentially it's something which is competing invisible from the user point of view From the I'll give you some numbers. So we work with around 60,000 publishers Around the world including Over 50% of top 100 websites publishers in the US in UK We also work with almost 50,000 merchants around the world and we work with them through Networks so we don't do the tracking of the affiliate links ourselves We are just like this extra layer on top of the networks Last year we process over 80 billion page impressions and almost half a billion of clicks and we drew a value of around 800 million of e-commerce transactions to all the publishers around the world Which translates into Hundreds of terabytes of data so pretty pretty massive big data project from that point of view What we do with this data is we process it we aggregate it and we present it as a customer reports to the publishers We also do some data exports. So we aggregate the data and we send it to Publishers and the customers into the we drop we can drop it into Amazon S3 bucket or Google cloud storage, etc. And we also do some machine learning predictions so Why airflow? So around a year ago We decided to do some big changes in skim links. We used to use Hadoop cluster and we decided to Basically replace the Hadoop with big query, which is a managed SQL database and we used to for the scheduling we used to use Uzi Apache Uzi and we settled Instead for the airflow. Why airflow? Well airflow is written Entirely in Python. So what that means is Python coincidentally is a language of choice of a lot of data scientists So if you are a data scientist and you want to implement some of the machine learning, you know You can do it directly inside the airflow You can think about the airflow as a cron on steroids. So it will help you to schedule your things In like your batch processing. So let's say that you want to do some computations every one hour or every day You can very easily do it with the airflow It's been a great productivity enhancer for us. So we Calculated that during that last year when we migrated from Hadoop To big query and the airflow we managed to release roughly twice as many features Thanks to airflow as what we did before with the Uzi This is in large degree thanks to a lot of of the features that airflow gives us namely Modern user interface, which is something that you would never consider as productivity enhancer But it helps you a lot because it basically it scans the Python code in your jobs and it visualizes all the jobs that you implemented In that user interface that you can see on the right side It shows you also which jobs are currently running which jobs were finished You can you have access to real-time logs of every job every task in your job So you can see if anything went wrong You have a so it's not just only static view But you can actually drill in into any job or a task and if something went wrong with two clicks You can retry any task very easily and Then on top of that we have such features like data profiling So let's say that You notice that your jobs are getting slower and slower and you don't really know why you can Just go into on the section of the UI called data profiling and the data profiling is going to show you Which part of the jobs are the slowest so you can immediately see Where you need to spend the effort to speed things up On top of that we have the command line So if you if you are the guy who doesn't really like the user interface You can just run a lot of things directly from the command line and then last but not the least this entire Airflow stack is horizontally scalable. So and skimmings we run it with the Kubernetes Using so-called seller executor I will come back a little bit to the subject later What that means is that we can have multiple workers. You can have pretty much as many workers as you want to Which will in parallel run all your computations and So I already said it's written in Python is also called first Scheduler what that means is that you define everything inside your Python code So things like where that job should start. What is the start date? What is the you know how frequently the job should run? Whoever one hour every one hour or every one day all of that goes into into the code and Last but not the least it has a great open source community So chances are that whatever you are trying to do you will probably find already some existing implementation online Inside the Airflow repository So here is what we arrived at after one year of migration from the Hadoop to to BigQuery So this is roughly the architecture So on the left side in here. We have the input data So that's the the page impressions is the most heavy part where we have the terabytes of data So we basically Push all the page impressions to the Google pops up and then write them into the Google cloud storage and use data flow Apache open source version of that is Apache beam So we use the Apache beam to read the data from the Google cloud storage and import it into the BigQuery And then the clicks are imported in our case in real time So we have a system that reads them in small batches of few hundred clicks from the pops up and Inserts them in real time into into the BigQuery One important thing to note in here is that Airflow is going to help you with the batch processing but not with the real-time processing Airflow by definition is the scheduler that will schedule some job for you to run in Predefined period of time While with the real-time processing you need a process that just keeps running in the background all the time So a part of this real-time processing Beat everything else in here is orchestrated by Airflow. So we also import some data from the MySQL so customers metadata Commission's data things like that All of that ends up in the in the BigQuery, but you can very easily do it in any other SQL database. We just opted for BigQuery because of the size of our data Very important thing is also the monitoring. So you always want to make sure that you have a Like very good checks on the quality of the data So whenever something goes wrong, we get the notifications on this like we track Or the like important time series in the influx DB and visualize them with the Grafana and then on the right side We export the data into three Things so we have the internal Data warehouse which is hosted in the MySQL We also have the reporting for the end customer which is actually done in the elastic search It works surprisingly works pretty well for us And then we as I mentioned earlier we sent some data to Google cloud storage and Amazon S3 So a part of this pipeline that I just talked about we also use the Apache Spark for some heavy machine learning processing and In here Airflow gives you really great flexibility when it comes to running Well really all your processing there are always multiple ways to do the same thing the same is true for Apache Spark So we have three ways in which we can run the Apache Spark Probably the best way would be to do it using the PySpark module Together with the Airflow's Python operator so inside the Airflow Python operator you can run any Python code you want to and then You have the PySpark module where you implement your own like Spark computations that you want to do and this way Airflow will orchestrate the Spark From within the Python code Another tool alternatives is to use the Spark submit So basically you keep all your code inside the Apache Spark and then you just submit the Spark jobs to Spark from Airflow using the Spark submit Which you can do very easily using the well there is a Spark submit operator for that Or if you prefer you can actually implement it yourself using the bash operator plus call the Spark submit from from the bash operator the reality though is that Majority of the machine learning tasks are usually spent on data engineering so Things like cleaning input data ETL extraction transformation and loading Preparing your features running the series of jobs and then eventually Protectionizing the entire data pipeline. That's something that Airflow is really going to help you with So it's not really going to help you much with the machine learning which you implement yourself with Python But the protectionizing all the jobs We'll help a lot with right so before we go into the details on how the Airflow actually works I wanted to ask you that anyone here uses already Airflow Hands up. All right Only one person so it's good that we are covering some basics So when you will be working with Airflow you will hear the terms of DAX all the time So DAX stands for the directed icicle graph and you will Create so your DAX is essentially your your job your workflow DAX is built from the tasks and the dependencies between the tasks. So the task is the bit where you process some data or you for example run the Apache Spark computation and then you also specify the dependencies between the task which Tell Airflow what should be what should happen next once that task will finish. So for example in here on the right side We have a sample pipeline from skimlings where we first We process the clicks You can notice that comes pipeline and pages pipeline, which are all green They don't have any dependencies so they can all run in parallel For example the link activity pipeline has a dependency on the clicks pipeline Which means that it can only start once the clicks pipeline will finish Linker when you pipeline on the left top bottom left I mean those names doesn't really matter what you know, it's just the name of the task This task can only start once the both comes pipeline and link activity pipeline will finish You can you can define much more complex Dependencies in the task so you can have some branching operators. You can have joins You know like once both tasks finish then the next one should finish it should start Branching where you can skip some tasks. So for example, you can have certain tasks that run over weekdays But you want to skip them during the weekend and so on so It's a very simple, but very powerful concept Examples of the dykes and jobs You can for example create the report Running by running some SQL query and then storing the results in the output file or the output table You can extract the features for the machine learning pipeline or you can trigger the purchase park job then what is very exciting about this is that Airflow is going to parse your tasks and the dependencies between them which you implemented in Python and we'll do this visualization Automatically for you and this visualization is interactive So you can click on any of those tasks you can drill into them And you can see the logs for this specific task where you can rerun the this task together with the any dependencies So let's say that if for whatever reason the page pipe pipeline failed We just on the right side and we want to rerun it if we select that we want to also rerun the dependencies We are not Airflow is going to automatically rerun for us the pages pipeline the page activity pipeline and all the dependencies below that as well It's very useful because you know the problems in the I mean There are always problems that do happen from time to time in the production system and you might want to fix such things so operator is You can think about the operator as a Class that defines what is going to be done inside your task So we have a choice of operators that there are some building operators There are country operators or you can write your own custom operators Example of the operator would be for example postgres operator which allows you to Execute ns equal query of your choice against the postgres database or a bash operator where you can run any custom bash command Inside the bash Then the task is an instance of the operator. So you can think or it's an instance of your operator or Anyway, I would come back to it or a sensor So from that point of view you can think about the task as an instance of the of the class or an object A task is a node in your doc and then so when you instant When you create the instance of you of the operator you also have to provide the parameters for this operator so for example with the Postgres operator your parameter will be SQL query that you want to run against the postgres plus the connection details That you want to execute Against in the database We can also use very powerful Gingya templating system inside the airflow so It allows you for example with the bash operator. You can you can put some parameters inside the Gingya template and then just parameterize them by passing variables inside inside your task on the right side we have the Example of the big query operator. So we create the merge Merge task, which is the instance of the big query operator We specify the name of this task the SQL query that we want to run against the big query Database the connection ID so like, you know, you might have a production database staging database and so on As well as some additional parameters the interesting parameter in here I wanted to point out is the number of retries that you want airflow to perform before Succeeding the task. So for example If airflow we be running this task for the first time and the task will fail for whatever reason Maybe because there was some problem with the connecting to the database then airflow is Automatically going to retry that task again You can you can define how many retries you want? This is a very powerful feature because it allows Airflow to see heal itself So like very often what we find out is that we you know go to the office on the morning see some emails that there was some problem With one of the pipeline, but actually everything went fine in the end because of the retries So there are quite a lot of problems that just happen Randomly from time to time Some advanced features that I'm not going to be covering in details in here, but you can you can use with the airflow hooks Gives you the interface to external databases and platforms So for example, there will be a muscular hook postgres hook BigQuery hook and and so on then the connections are stored in airflow inside the airflow meta base and You can modify them using the airflow user interface So that's a very Useful feature because if you have two environments, let's say production and the staging environment You don't have to bake your connection inside your code You can you can just store it inside the airflow UI, which will be done stored in inside the airflow database Then we also have variables So you you can also define variables inside the airflow meta base, which you can control through the airflow user interface Same story in here You don't have to bake and hard cut some some of the variables inside your code such as let's say You might want to use a different Amazon S3 path in production and different Amazon S3 path in your staging Environment, you don't have to put it into your code. You can just modify it directly from the airflow user interface Xcoms allows you to pass some parameters from one task to another task So it's a mean of communication between the tasks. So let's say that one task One task can push some some information to the next one such as I Processed the data for the last three days and the net next next task needs to be aware of how many days needs to process the data for sensors allow you to Schedule the tasks when certain criteria are true. So As I mentioned earlier, you can airflow allows you to Schedule your batches so you can you can tell airflow to run some kind of processing once a day every day at 9 a.m But you might add some extra criteria for it such as okay run this at 9 a.m But only if that specific file already exists in the in this specific folder So for example in HDFS you might you might want to wait for some external Dependency to finish that airflow doesn't have any control over with the sensor You can you can basically check whether this is this criteria is already true or not and then You know start execution of your task once once it would be the case and Then there is a lot more in the plug-ins. So hooks connections Web views template macros and and so on that could be packaged into the plug-in and Anything that should not be a part of the airflow core Usually it goes into into the plug-ins. There's a great repository in the github airflow plug-ins where you can find some very interesting Implementations of such things like Google analytics for example It's something that you know you wouldn't want to have in the airflow core But a lot of people use it and they might want to download the data from the Google analytics Or some integrations with the payment platform such as stripe and so on so here is the sample code that Shows how you would create a doc inside your airflow So on the right side we basically create a new doc We give it our name called tutorial and we create three tasks So we create the tasks t1 t2 and t3 all of them are instances of the bash operator And then in this case We just give it some IDs and then provide the commands that we want to run inside each of those bash operators So the first task we just run the command date the second one will sleep for five seconds And the third one will use the ginga template that I mentioned earlier that this parameterized with the with our Params in here so in the params we can we pass a Parameter called where it's a dictionary with the key Called my param and that my param is going to be injected in here So you can access it from your ginga template as params dot and the name of the of the parameter Which is defined in that dictionary There are some additional macros that are available out of the box inside the Airflow so for example in here we have the macro called DS add where we add seven day to execution date You can also create your own macros, which you can refer to inside the ginga template on the left side in here. We have some Parameters such as start date which defines from which date this air flow Dark should start running And we also define the schedule interval. So how frequently that should run so in this case We have a schedule interval of one day Okay, so as you can see everything is inside everything is implemented inside python and then air flow is Intelligent enough to to basically parse this code extract all the tasks from it and do the visualization Showing what kind of tasks we have inside and also showing the dependencies between these tasks. So in this case, we have tasks t2 and t3 that are Dependent on t1 so t1 is upstream from t2 and t3 Which means that only once the t1 task will finish then the both t2 and t3 Start and they can run in parallel So let's talk about some of the air flow best practices that we use in in skimlings So very very important one that saved our life few times I the pontent DAX so it's gives you So if you implement your tasks in such a way that there are I the pontent You gain the possibility to rerun those tasks in the repeatable way To do it you just need to make sure that the task is always cleaning after itself Doesn't leave any side effects and you can basically rerun it safely twice and have the same result So if you don't do that then you might have a problem because for example, you know life happens and there will be some issues in your production System from time to time and then if you will try to rerun a task in a production system suddenly you will find that there were some Side effects that you didn't you didn't account for so for example You might have twice as much data or maybe you have three times more data because you run the same task three times And the task didn't clean after itself So it's very very important because it allows you to rerun the tasks in the repeatable way Tests so we did struggle a little bit with the Creating good tests for our DAX and the tasks So the good news is that airflow gives you like a quite easy way to test your tasks There is a command line called airflow test where you specify the name of your dog The idea of your task and the date and airflow is going to execute that specific task however That wasn't enough really for us because we have some complex DAX with you know Many tasks inside them and then running Manually and testing manually every single task wasn't just wasn't cutting it because There's a lot of dependencies between those tasks. You want to make sure that you you run them all in a correct order so what we ended up doing is basically creating a test DAX for every single DAX that Runs exactly the same code as the DAX itself and then Runs it on the input data creates the Output data and compares that output against the expected output. It's a pretty much a standard What you would expect from the standard integration test We Execute the integration test during key to build. So basically if anything breaks you will Be notified by Jenkins or whichever building process you are using Also, it's a very good idea to use separate environments for production staging and the test or local environment So in our case for the local environment We have a fully dockerized local solution. It's very easy to do with the airflow and Once you do it You essentially can run the local dockerized airflow with a single line of code So let's say that you have a new developer who wants to start and join your team They can do it in three minutes because all they have to do is just download their Kit repo and then start the local version of the airflow with the docker compose up in Production we are reusing the same docker images and inside the Kubernetes cluster so we have a horizontally scalable Workers with the seller executor Very important thing is to make sure that you store your logs from your production cluster somewhere in the Persistent storage so Amazon s3 hdfs or Google cloud storage because you workers will die from time to time and When the worker dies and all the logs were on the worker, you are not going to know what happened like what caused that problem Ideally if you can run a fully managed cluster that That's the best solution because then you don't have to worry about maintaining the cluster Upgrading the airflow and so on there are some commercial solutions that allow you to do it When it comes to deployment you have few strategies, so there is a pool method So you can run the gitsink on every single of your workers Ui and the scheduler and then pull the latest code from the master branch or production or the developer branch If that's a staging cluster Or you have a push model where you can push the code using something like our sync to every single of your workers Alternatively, you can use the Persistent volume so you can have all the workers in the read-only mode that read the data read the latest Python code from from a specific if big volume and then you have a Rightable volume where you push all the latest code or you can just bake in your code inside the docker containers The deed does that disadvantage in here is that? Whenever you are creating the new build you will have to recreate entire cluster So you have to kill all the workers. You cannot really just push easily The new code if that code is baked into the docker containers Whichever way we whichever solution you choose airflow gives you very good support for scanning the changes In the latest files, so there is a setting in the airflow That will scan that allows that defines basically how frequently airflow should scan all the files for the latest changes I believe by default it's every few seconds, but you can you can increase it to Every few minutes and then air flow will automatically pick up latest some changes and Will around the latest version of your airflow? dogs So to summarize some of the good the bad and the ugly that we found In airflow during the last year There's some issues as well with the airflow. I mean it's a relatively new open source project So, you know, it's expected One problem that we've seen is displaying the dynamically generated dogs So because you create a dark in the python code you can you know, you can dynamically create the tasks So you can for example create 10 tasks today, but five tasks tomorrow If your number of tasks dynamically changes that could be a bit tricky for the user interface to visualize So just keep that in mind dogs dependencies In here you have a choice to either put all your logic inside some big and complex dogs or Speed them into some smaller dogs Usually smaller dogs is better, but then you will have to somehow Make sure that you connect You connect them within the airflow So you need to let the airflow know that once this dark finishes the second I can start you do it with the With the sensors the problem the sensors is that they're going to take your resources because the sensor is essentially a process That is waiting and checking some condition all the time So like it can wait and check every few seconds whether this file already exists in the HDFS or not So, you know, it's extra resource One problem that we are seeing in our current version of the airflow is that the scheduler just stops scheduling things after a few days So essentially we there is a setting in the airflow that allows you to restart the scheduler We restarted every few hours It's not the case apparently in the latest version of the airflow So if you use the latest one point ten point three that should not be the case anymore And then we also see some zombies problems from time to time So basically the processes Just die. There is no heartbeat coming out of them and then airflow marks the processes a zombie Even though the process continues and works just fine Fortunately, it's not an issue because we talked earlier about the retries So airflow will automatically try to retry that process and and usually if you implemented it in the I-deponent way meaning, you know, there is no side effects of rerunning it. It's going to work fine then The last thing is the samba tags. So there are some developers that Embrace them. We like to use them quite a lot, but there are some developers that really try to avoid them We like them because they allow us to encapsulate the complex code So if you have a huge doc with lots of tasks, you can group a lot of of the tasks into separate doc and you can include that separate doc as a sub task in your existing in your existing workflow So to summarize We managed to finish the migration from Hadoop to to BigQuery Everything works great airflow has been a huge productivity and answer for us Like we believe that we are at least twice as productive using BigQuery and airflow than we were with the Hadoop and Uzi So, yeah, we are very happy as you can see at skimlings and How you recommend it? Thanks very much so I don't know if we have any time for questions, but If if anyone has any question, just let me know I'll go on but the sub docs So some duck so essentially you remember when we talked about the operators different type of operators You have a basically doc operator where you use another doc as your task So so basically It just becomes Similar to one of your tasks you can really in into such a such a task and you can see Expansion of all the tasks that are within that other dug Well, we were reading a lot that there Scheduler is struggling with some of them because basically, you know, there will be scheduled on the same level They'll be executed as a single task We didn't really see any problems with that, but I was reading a lot about some issues Yeah, no problem Well, it depends how you schedule it how you set it up, but there shouldn't be any single point of failure I think probably using that user interface might be a single point of failure, but a part of this You you can very easily do like multiple threads for the scheduler. You have a multiple workers So it shouldn't be the case right sorry for taking extra time Right, thank you very much