 Hello and welcome. Thank you for for having me. I'd like to thank Alexi and Rosa for encouraging me to do this talk. Thank you guys. They are my bosses and my yearly review is coming up so you know. All right so who am I? My name is Sebastien Croquetvieille. I'm French. This is my first time speaking at EuroBitone. So thank you for coming to my talk. And I'd like to introduce my dog Kimo. She's very cute. So as a data engineer at Numberly, I've worked with Airflow in production for almost six years now and I work with a standard data engineering stack. You know, PySpark, Python, MongoDB, a bunch of other DBs like SelaDB, Hive, etc. I've been attending EuroPython since 2019 in Basel and it's a great event so thank you so much for for attending and for hosting for those of you who volunteer. And I flew from Taiwan to be here. So yeah, very excited about this event. Today I'll be talking to you about Apache Airflow and a little bit of what Calvin discussed earlier. I'll be talking to you also about how to have it on your local machine so you can test stuff. As this is an intermediary talk, we'll be mostly covering some but not all of the basics so I might skip over some stuff. No big deal. I hope to convey some practical knowledge so that you can use Airflow on your own stuff. Before we begin with the actual presentation, here's some stuff that you shouldn't expect from this talk. So first of all, there will be no comparisons. I know that some of you may be Dagster fans. I understand that some of you are a little bit extreme. We can talk about it after the talk, no problem, but I won't be covering it because I have not tried any of those other solutions. I only have 30 minutes so I'm going to avoid some complex stuff and just stick to some simple use cases but I'm open to any questions. And this talk is also about memes. I like memes a lot so I'll be showing it during the talk. Let's dive in. So I'm going to introduce Apache Airflow a little bit and we're going to get to talking about it. So first of all, what is Apache Airflow? It's an orchestrator. An orchestrator is something that runs code in a certain way at a certain time and does so repeatedly. You can also call them schedulers. Same thing. Apache Airflow is open source. That means that you don't pay money but you pay with your own time and effort to learn. As a result, you get something that works pretty well and your skill increases so that's something that we care about it numberly. Another thing that's very important about Airflow that was showcased a little bit in the previous talk is the Kick-Ass UI. Having a good user interface allows you to empower non-tech people which is very important for us. I work with a lot of PMs. They have some controls. They have some knowledge things to the interface. One of the key points is when a client calls and he's not happy because a file's not there or something's going wrong. He starts to pound on the PM via phone call or whatever. They can go check the UI. They can tell now everything is fine on our side. It's your fault and they get independence and a little bit of resilience from that so it's very important to them. Airflow is mostly in Python. It has a very active user and contributor base. There's like 15 different Airflow tags on Stack Overflow and there's daily questions. There's a PR every like three to four hours. I checked this morning with three hours so it's very active and it's something that keeps developing and as Calvin said there's new features that keep coming out and they're very useful. Finally it's number 27 on the OSS rank. It talks about its quality a little bit. How would you go about installing Apache Airflow and what exactly are you installing when you do something? It's really, really difficult. All you have to do is install Apache Airflow and you're basically done. There's a few extra steps but you go get some coffee. It runs the script and it's very much batteries included type of architecture. Kind of like what Alexi mentioned in his talk, it's like Python. You take it, it just works and it gives you a bunch of features more than just what you asked for. So we'll talk a little bit about the architecture, about the different parts and then we'll do some code later. So first of all, there's the UI, graphic user interface. It gives you a trust what you can see approach which is very nice. We'll get into a little bit later. Then there's the web server, scans the metadata database, scans the DAGs, lets you see everything that you have, gives you some control over a few simple objects like connections. Then there's the scheduler. It scans more or less the same thing but it does a very different role. It's the heartbeat of your orchestrator. It finds what needs to be run, what is on your machine and gives you parsing, discovery and scheduling. The database mainly keeps your history and some references to different objects that you have on Airflow. The DAG directory, if you just install it locally, it's just a fucking folder. If you use something a little bit more advanced, you might need something else. And workers are just the elements that run your code. Locally it could just be Python shells. So let's start with the database. Once you install Apache Airflow, you need to initialize your database. So AirflowDB initialize init. They only do that once on your first time setup. Locally it uses a SQL light. Don't use that in production because you can't have any concurrency with that. So that will kind of break your production setup if you do it. And it uses SQL alchemy under the hood. So if you plan to use the UI, you also have to create an Airflow user. Similarly for the web server, if you want to pop a web server, there's no front end required. So you don't need to know any JavaScript, but you can still have a UI, which is nice. And it just works. It's tailored for observability. And you can use the CI command to launch it from anywhere. So once you do that, you get the user interface, which is quite nice. It gives you a rundown of all the DAGs that are on your service. You can see the owners, the run history, the schedules, the last run, the next run. And all the way on the right, the big white thing is the last execution, what are the state of all the tasks of your last execution. So like I said, it gives you some observability. And your PMs have an instant idea of what it is that went wrong. If there's a little red box, and if there's none, everything is fine. You also have a grid view, which gives you a little bit more detail, tells you how long things took to run, what tasks ran, what tasks didn't, what time things ran, et cetera, et cetera. So it gives you the ability to see statuses and monitor performance over time. Similarly, for the scheduler, once you've installed Airflow, just Airflow scheduler and boom, you have a scheduler running on your local machine. If you don't configure anything, it will run with a sequential executor, which, like we've said earlier, doesn't do any concurrency. But if you run it in production, you'll generally use the salary executor or the Kubernetes executor, which have different properties that allow you to scale much better. Next is the DAG directory. So like I said, it's just a folder. If you don't take any of the tutorial DAGs, it won't exist when you first install your machine, so you have to make it. If you use Kubernetes like us, you might have to change a couple things. You won't be able to obviously use a local file, so you'll either have to pull all of your DAGs from something like GitHub, or you might have to mount a volume to avoid losing your DAGs if ever your pods die. All right. So now that the introduction and all this information is out of the way, let's focus on some core Airflow objects and play around with them a little bit. So basically, doing Airflow is just coding Python DAGs. That's all you're going to do. Unless you want to configure it around a little bit, it's fairly simple. So DAGs are basically jobs. They run again and again with a few modifications from the context, like different dates or different elements that you feed them from a config file, but essentially it's just a bunch of boilerplate code with a few specific things that you feed it, and then it just works. So the object you see at the bottom there is the visual representation of a DAG. Each of the little boxes or nodes are tasks. Ideally, you'd want those to be atomic and adempotent operations, because that'll help you get the most out of Airflow. But honestly, you can just do whatever you want, and it will run anyway. Finally, operators are what we'll be focusing today on this talk. Those are the core Python elements that actually run your code. And so since it's all Python, it's very easy for us to just go under the hood. There's a very nice documentation, but if that's not enough, you can just go into the source code. It's all Python code, and it's very easy to look at. And actually Numberly has contributed a couple of times to Airflow OSS and requested features, because since you can understand it, you can ask for stuff. So that's quite nice. So we'll be looking at two objects today. The first of the two is hooks. Hooks are a high-level interface. You basically just imagine it like you would any random hook. It just goes out, it gets some sort of connection, and it brings it back to you so that you can use it. Generally, it hides a bunch of stuff, either complex operations, API calls, or an external library that you're using that you don't want to import in your code. You just import it in the hook, and that kind of puts it away, and you don't have to deal with it ever again. Interesting methods to implement for yourself are like getting a connection, closing a connection, doing file operations, et cetera. So operators, like I've mentioned before, are the objects that actually run your code. What's important about operators is that they have to, first of all, try to keep the properties that we've discussed earlier, or at least maintain them if you have them in your code. They're often built using hooks to use one Airflow object to build another Airflow object, and these are not objects that you will call yourself. Airflow will call them directly when you run your DAG, when you schedule things, so it's important to have that in mind because it will influence the way you make a custom operator. What's great about these is that since you don't call them yourself, you can basically generate 10,000 DAGs if you want, have them all call a bunch of operators with a bunch of different settings, and you can basically reuse your code with very little code. So you could say that since they give you such a streamlined experience, they're smooth operators. Of course, you don't get all this for free. You have to guarantee a few basic principles to be able to get Airflow to work with you. Generally, what you do is when you create these objects, you inherit from base models, so you have the base operator, which your operator's inherit from, the base hook from which your hook's inherit from. You can have intermediate steps if you want. In Airflow, for example, most of the relational databases inherit from the SQL operator, which itself inherits from the base operator, but essentially at some point, they have to inherit the traits of the base operator. So here's the use case that we'll be having today. We're not so much in Python, a compiling community, but we still have data pipelines and a bunch of stuff that takes a long time to run. So here we'll be imagining that I have like some sort of workflow that takes forever to run, and I just want to go have a break, sit on the toilet, and look at memes. So I'll be using the PRAW Python library to send myself a Reddit message when my pipeline is over to tell me to get off the toilet and get back to work. All right, so first we started with a hook. Like we said, most of the stuff in Airflow is done with hooks, so we just make a hook. We inherit from the base hook. We give it a constructor. And in the base hook, there's a GetCon method that's not implemented but implemented, and that's what you basically override. You override that and you put whatever you want. In a sense, hooks are very, very free compared to operators. You can actually put whatever methods you want and then use them however you want in your Python code, but you'll have to deal with them later since they won't be called directly by Airflow. So here we'll take a little bit of a look at the code. So at the top you have imports. You don't have to worry too much about that. Then we define a hook at the very top there. So it's basically a Python class. Defining a hook is defining a Python class. Inside this class, we create a constructor. In the constructor, I check just a little bit of the connection. So it's getting an argument that is a con ID, so a connection ID. Just make sure that that's not a null or an empty object. And if it is a null or empty object, I raise an Airflow exception. This is nice because it allows me to open a little parenthesis about Airflow-building exception handling. So Airflow has built-in exceptions. It has failure exceptions. It has retry exceptions. It has skipping exceptions. That's nice because it doesn't just let you do normal messaging like exceptions tend to do in other frameworks, but it also lets you to directly affect the way your code is run. If, for example, you raise a failure exception, you can fail a task directly without getting it to retry. If you do a skip exception, you just skip a task, which means it didn't run, but it's OK. So you get a little bit more control from Airflow exceptions than you would do from traditional exceptions. And finally, so I put the base of the work inside the get con method, which basically gets connection information from the Airflow connections and then just feeds them to a Reddit, pro Reddit object, which returns a working connection to Reddit. So now I have a hook that allows me to go, get a connection to Reddit from this library that I have imported. And the next step is making an operator. So similarly, I inherit from the base operator. I create my constructor if there's things that change. And here, what's very, very important is that you have to overwrite the execute method. In a hook, you can do more or less whatever you want. That's your problem because you'll be using the hook later. But Airflow is the one that calls your operators and it calls them from the execute method. So that's your entry point. If you want to define other stuff, that's OK. But anything that you don't use in the constructor or in the execute method will not be called, unless you do some dark voodoo magic. So similarly, once again, making an operator is just making a class in Python that inherits from the basic operator. So that's what we do. Then we define a constructor with all the extra parameters that I need to be able to send my Reddit message. So I'm going to need the connection to the Reddit account. I'm going to need the user that I'm targeting, which is going to be myself, the message that I want to send. So that's basically what I'm putting in the constructor. And then at the very end, I define my execute method that receives a context. So the context is not fed by me. It's fed directly by Airflow. It has a bunch of information in it that I'll use a little bit later. All this code will be available to you on GitHub. I'll send you the link at the end of the slides. And the slides will also be made available. So at the end, we have the execute method. And there, I just call my Reddit hook on the first line, give it the connection it needs, and get my connection to Reddit, which is nice. Then on the next line, I look up my Reddit user, which is myself, from the parameters that I fed the constructor. And on the last line, I send the Reddit user a message in which I feed the DAG ID that I got from the context. So I know which pipeline is calling me. So if it's one that doesn't matter, I just stay on the toilet. And that's pretty much it. It's as simple as it looks. So now that we've defined custom operators that do custom stuff, and this could be literally anything, except maybe sending you Twitter messages, since that's no longer available. But yeah, so now, since this needs a DAG to work with, let's just put it in one. And DAGs, similarly to all the Python objects we've discussed, are also a Python file. So you just put your imports at the top. We're going to need to import whatever operators we're using, DAG, and a bunch of timing functions. Then I define the base configuration. I'm not going to go into too much detail about this, but there's basically big categories of stuff that's in the configuration. You have some stuff for your ID. You have some stuff to tell you what the parallelism of your DAG is going to be. If you can run multiple tasks in parallel, if you can run multiple DAG runs in parallel, you have some error handling. Do I want to send emails? Do I want to send emails when it fails, when it retries? That kind of stuff. And the email that you want to send your building, alerting emails to. So yeah. Then once you've defined your initial configuration and your imports, we're going to actually create the DAG object. So you basically use the DAG context manager, feed it all the stuff that you've defined previously. And all that stuff is fairly boilerplate. I put it here to show it to you. But essentially, if you always use the same configuration, you can just mess with the default values, and you don't have to do most of it. So now we get to the interesting part. I'm going to create two tasks. So I'm going to create a task called Working Hard that pretends to do some work. And then I'm going to create a task called Hardly Working that comes to check, send me my message via the Reddit message operator. And then it's as simple as calling the operator I created, giving it all the arguments it needs, and using the bit shift operator at the end there to order my tasks in the correct order that I want them to be in. Because Airflow has a specific syntax for that. So once that done, it works on my local machine. And yeah, I blacked out the name to avoid reverse doxing myself, because apparently you can't send message with new Reddit accounts. All right, and so once we've done that, we see that it works. And we can go back to the UI and check it again for all the little benefits it brings. The UI definitely gives you a lot of perspectives to see DAG execution. Here you have the graph view. You just see the tasks and what order they execute in, what's their state. You can have the Gantt view to see when is Airflow doing something? When is Airflow not doing anything on this particular DAG? How long does my execution take? And you can use the task duration to see if you have bottlenecks or things that you can solve through parallelism, that sort of thing. So I'm going to touch on a few more subjects, and then we're going to talk about Airflow in production at Numberly. But there's some cool stuff that you can do with creating your own operators. For example, templating is super easy to set in your operators. On any one of the fields that you have defined and that your operator uses, you just put them in the template fields. And voila, you have automatic ginger templating in one line of code. And then you can use any of these macros inside your ginger templates. Or you can actually use Python functions inside of that, like daytime functions, et cetera. One of the pitfalls that you have to avoid when you're doing this kind of thing, when you're installing Airflow for the first time or when you're using in production, is not worrying about your Python path. All your imports must be done from a path that's in the Python path. So here, for example, I directly put my GitHub repository in the DAGs folder. So if I want to get my object, I have to go in the DAGs folder, then get my file. So from repoName.myModule, I have to import my object. If you were to do this in production, I'd recommend having either an installed library or a specific folder that you define and add to your Python path with all your custom operators. But as long as it's on your machine, you can do this kind of thing. All right. Now, let me give you a few numbers and some information about how we run Airflow in production at numberly. So we're a pretty big company, but we don't have 10,000 DAGs. So it's about 600 active DAGs in a single Airflow instance. We use Kubernetes for execution. The running consumption is around a fourth of a CPU and 1.5 gigs, but these are pretty beefy CPUs. And the DAG parsing time is varied from 5 microseconds to 0.5 seconds. Sorry, 0.5 microseconds to 0.5 seconds. The 0.5 seconds is for more dynamic DAGs with lots of generated tasks, which is very coherent with their previous talk. And we've been doing this for about seven years. So here are some of the features that we use in Airflow in production. So we love the Airflow API. It's very useful. You get a lot of information from it. You can trigger DAGs dynamically from it, make it interact with a lot of stuff. It comes with its own documentation directly in the web UI, and you don't really have to deal with that. And the security settings are already pretty much set up. So that's very nice. We like to use sensors a lot, too. They are operators that basically check if something has happened or not. And if it has, they do something. If it hasn't, they fail. Python virtual-end operators, which allow you to run your code in a virtual-end so that you can use libraries. You don't want to install in production. And obviously, our custom Airflow objects, an important point about this is that we use a homemade Airflow helpers library, which allows us to get as many people and the data teams to have a collaborative effort into creating these objects. It's really simple, just like you saw. You can just get a PR from anyone, review it, and bam, you have a new tool that anyone can use. And it also helps raise everyone's knowledge of Airflow, which, as we've said, is full Python. And so it's fairly accessible. Here are some of the stuff we don't use yet. I'm not going to go into the detail of it, but the list is here for reference. If you want to talk about it, you can come see me after the talk. Thank you so much. Can I have some questions? And I accidentally welcomed all of the numberly speakers today. And on the other days as well. And it's a delight to listening to you all. Thank you for your talk. And we do not have any questions on the Discord channel, but if you would like to ask something, we have time for two or three questions. You can go by the microphones in order to ask a question. I guess no, but one, please. Thanks for the nice talk. I am wondering how do you test your DAGs in CI-CD pipeline? Do you have some automated tests for that? Yeah, we have a couple of things. So first of all, I don't know if you saw the previous talk, but one of the more famous things is the broken DAG error. When you push something to production and in the parsing of the DAG, there's an error. So first of all, we have a validation step that just runs in an image that's similar to what we have in production with all the installed libraries and just Python Airflow DAG.py that allows you to check if there are any extra libraries that aren't there, if there's any problems with parsing the DAG. You could probably also implement a time check on how long it takes to parse the DAG. That's not something that we have done yet. But those are all things that you want to validate before you run your DAG. The next thing is that we do implement a code review process on the tasks that we implement. So that's if you use, for example, a Python operator, you're going to be able to make your own custom code and put it directly in the DAG, so we review that. But yeah, that's essentially it. I mean, you have the normal linting stuff, black, flakey, that kind of stuff, band it. Thank you. It's a delight to listening to you. And if you would like to find the slides and the speaker himself, you can reach out to them from the Europe Python 2023 Discord channel. And thank you so much. Thank you. Thank you.