 Awesome. Thank you all for coming and talking about Airflow and DAG factories through this beautiful morning here in Prague. I'm excited to be here. A little bit of an orange and backstory on this talk. There was another talk I gave about a year and a half ago that talks about data pipeline modernization at scale. And that was basically where a lot of the work started for this. So if you wanted backstory, there is a QR code here that takes you to the YouTube. The slides also I'll post these on my Twitter right after. And if you want a copy of the slides, the GitHub, actually, I'll go back real quick for anybody who didn't want to miss that. But if you also just search for my name and data pipelines on YouTube, you'll find it's a benefit of having a long, crazy name that no one else in the world has. The slides for this and actually all of the code that I will show is all in this GitHub repository. And it has, in addition to the slides, it also has a full Docker compose setup. So you can actually try all the things that I am getting ready to show you during this talk. Be warned, it will spin up about 11 containers of stuff. And if you go for the big example I'll show, you'll need a lot of memory, a lot of CPU. Thanks to yesterday's morning talk, we now know we have more CPU available to us than we ever could have, but some laptops may not have it already. You'll need it and your fans will start spinning up for sure. Why should you care about being here? I put two reasons here. I think there's two alternate reasons. One, you think Airflow may be the wrong tool. So you're here to find out why it might be the right tool. And you think Airflow can't scale to meet your needs. I think another person put it, there's two people in the world, people who love Airflow and probably people who aren't using it correctly. But I think that there's a lot of misunderstanding about what goes on behind the scenes inside of Airflow because it's a complicated beast. There's a lot going on. And before we get started, I want to make sure we also set our intention. This is one of my favorite Easter eggs in Python itself is the Zen of Python by Tempiers. A lot of good things can be brought out of the words that are here on the screen, especially that beautiful is better than ugly and explicit is better than implicit. But we're going to start off with showing some examples of using a DAG factory, which can be elegant and look very beautiful and very simple for what we think the task might be. But as we'll find out, there might be issues with that. I won't spend a ton of time here on the intro to Airflow because I assume a lot of folks in the room may know what Airflow generally is. But it's an orchestration platform. A lot of folks in the data engineering space use it for orchestrating their ETL pipelines. It's made up of DAGs, which are directed acyclic graphs. These are the things that are the tasks that you're going to do as you go out your daily work inside of Airflow. It's made up of operators. Operators are kind of like you think of as pre-packaged types of work. So you could have things as simple as a Bash operator or a Python operator. Or you could have a Databricks operator or some other operators that just automatically connect out into other places where work can be done. There's sensors and connections. So the sensors are going to give you the ability to watch for things happening inside the Airflow environment. Connections are going to be your pools of connections to databases or other sources of data or other APIs, things like that. Hooks are, again, an easy ability to access things like external APIs and external systems. And then the scheduler, which is actually kind of a lot what this talk is actually about, is getting things going in the scheduler. Who in the room has run more than 100 DAGs in their Airflow? And who's seen the error because it failed because your DAGs took too long to load? You couldn't even get them to show up in the scheduler. Yeah, so you all are experiencing the pain. Then I'm getting ready to show you a demo of, which is kind of cool. Generally, it works out like this. You have a user interface. That's the when you go to the Airflow UI, you're seeing the front-end web server. But really, that's just a facade over the top of the scheduler and all the workers. And as you put together your DAGs in that DAG directory, they are being read in pretty much any time another DAG runs or every periodic amount of time. The DAGs get read in a tremendous amount of times in the Airflow ecosystem from the scheduler to the workers because they basically get serialized and pushed off into any place that Airflow may need them. And since you can distribute Airflow across many, many containers, you can run it in a Kubernetes cluster across a lot of workers. That's a lot of places that those DAGs will get read. And then there's the metadata database, which is basically tracking what you have said you're going to do and what you say is going to happen and all the runs that have happened and all the logging that is going on inside of this. But let's actually get into what the real problem is. If you've played with Airflow, you know the most simplest case, if you're just doing like the tutorial, you can throw a Python file, a module into the DAGs folder. And if it contains a DAG class, it will pick up and produce a DAG inside the Airflow UI that you can now run those tasks. But you may have a lot of tasks you want to do. For example, I mentioned the origin story of this about doing data pipelines at scale. That specific data pipeline consists of over 10,000 tables that we're moving on a daily basis. So that's 10,000 DAGs if we do a DAG per table, which is a pretty reasonable approach here. But hard coding 10,000 DAGs and maintaining that kind of level of code in your repository is a bit unrealistic. So as you get to this kind of scale, you can do things with dynamic DAGs. Let me actually show you what that looks like. Bingo. All right. We've got an aptly named Python module called worst-case factory. But what it's doing, and then for this demo, and it's actually maybe an interesting approach that some folks may want to look at, is we generate a set of configuration files per table or per data source because some of the data may not be tables, it may be APIs, it may be file sitting on a file server someplace. And we generate a set of configurations. So this is that repository that's available on GitHub. And in this case, I generated 100 configurations. So 150 incremental, 50 full loads of some theoretical data source out there. And if you look at these JSON files, they're just a description of maybe where the source may be, where the destination may be, and who to call if things fail. So you could use this as actual pattern to now do an easy onboarding for new data sources. You just throw a JSON file into a directory structure. Our CI pipeline typically runs, and then away you go. It runs Airflow, and Airflow will read in if you're using this dynamic case. For every file in that path, for every config file in that path that's a JSON file, it will dynamically generate a, oh here it is, I missed it, a DAG. So we basically, based on the name of the file, create a new DAG inside the metadata database that we're now tracking as a job that can be handled. Which this works great up to a point. Let me actually, I'll demo that right now. So I've got 100, it was 101 because the tutorial DAG is in here. These are 100 DAGs basically generated off of that last technique, which this should be running. Yay, okay. Hallelujah for the demo gods. We're gonna make a change. So for those of you who would be following along in the code base, there is a Python module at the root of the repository called build configs. And at the very bottom of it, you can basically set how many configs you wanna generate. So we're just generating demo ones here for the sake of playing around with this. So going from 50, which was generating 150 incremental, 50 full, we're gonna now go to 500 incremental and 500 full DAGs inside of this instance. And we will remake worst case. So there are all you sell right there, all the DAG files are written. And now it's spinning up all the containers that are gonna run Airflow plus the scheduler, the workers, et cetera. And if you've seen this, I did this talk at PyCon earlier this year and I've added some things to it and actually added things into this infrastructure. We upgraded Airflow to the latest version of 3.6. That gives us some new interesting metrics that we can actually watch. And so this new demo actually includes a Prometheus and Grafana inside of it. And I'll show some of that later, but it's really interesting now. We can show where the performance bottlenecks are. You can start playing with it. Cause one of the things that's nice about Airflow or hard about Airflow is it has so many tunable options and it can be difficult to know which one is really affecting maybe what your problem is. So hopefully with this we can see a little easier what our problem is. All right. Airflow is back up. Aha. So I don't know maybe small, but at the top here in red is the broken DAG. Now broken DAG is probably a misnomer of an error, like a terrible error message in this case. Nothing about the DAG is actually broken. It's the exact same DAG was generated from the exact same code we used for the 50 example as for the 500 example. The issue here is that the DAG factory took longer than 30 seconds to scan through the configuration files and start building the DAGs dynamically inside of the metadata database. Now for those of you who are familiar with Airflow, you may know that inside the configuration, if we go in here to configurations, there is a timeout for DAG bag import. It's at the 30 seconds, it's magic, right? I should just change that number to whatever I want and keep going on my merry way. There's a problem there though. You're gonna need a lot more compute and CPU intensiveness on the scheduler to be able to start increasing that number. So I don't recommend actually increasing that number but approaching the problem a little differently and thinking about how we can actually solve that. Let me get back to said slides. So Airflow will slow down with lots of DAGs and to a point where it's going to error out and actually not even work for us at all. So at this point, our Airflow is in an unusable state. We've gotten kind of to the root cause of the problem which is that 30 second timeout. But let's talk about what actually solves the problem. So we were starting with dynamic DAGs which actually I think is a great approach because it's really clean. I have one Python module that defines my, basically my templated DAG factory or my DAG that I'll use for all of my various sources and destinations. One place to change the code, not having to forget about updating it if I happen to have 100 Python modules in my DAG directory, I've got one. Now we wanna be able to scale this up so the scheduler can actually handle what we wanna do. So let's look at what that actually takes. We're going to change our approach. If you look right here in that DAGs folder, this is that one module, the worst case factory. And so that's the thing that's looping over the configuration files and figuring out what to do. We're gonna change our approach and instead of having the DAG factory produce a set of DAGs inside of Airflow, what we wanna do is actually have a preprocessor function that will generate hard-coded DAGs in that DAGs folder that is generated code that we don't maintain. So we're wanting the same approach we want to maintain one file or one DAG configuration but be able to scale to 10,000 DAGs easily inside of one Airflow instance. Okay, so we're gonna leave our number of DAGs we're gonna generate to 500 incremental and 500 full but we're going to switch how we actually do this work and I'll do the example here first. So luckily there is a make better case because I would never be able to remember all these commands I'm running in Docker. So what happened, actually, I'll show you now what happened behind the scenes. That DAG folder now is not just having one module in it it's actually having a hash directory structure here of generated code. So for every config that was in here so you can see we've got the generated configurations we have the matching Python file that has been copied in here and that copy is just a copy of this module here called GenDAG from file name. Instead of a loop where we're looping over each configuration file this file just contains a single DAG in it and that single DAG grabs its name from a little bit of work up here at the top where we get the file name from the class and use that file name now to give an ID to our DAG. The code across all the DAGs is identical and it is just a copy of the file. We had tried doing Simlinks for this originally I don't remember why that didn't work but in the end since we're typically gonna be running this in a CI pipeline to produce a container that we will deploy into a Kubernetes cluster for running our Airflow we'll just do this in a CI pipeline. The CI build process will grab the latest Airflow image that will basically layer on top of that R1 DAG file and then it'll import all of the config files for all of our various data sources and from that we now get every one of these files is identical to the file I was just showing you that generated DAG from DAG file. All it is is a straight copy of that file. And now inside of here again we grab the file name to basically produce the name of that file and then the real cleverness or trickery is that when we produce our DAG it is basically gonna use that file name to give it a unique identifier inside the metadata database which if we go back to Airflow, bingo. So you'll see right now we've got 845 DAGs that are in the system that's still growing so what happens is it's like a multi-phase. First it reads through all the DAG files that are in that directory and figures out it imports them all and if that all happens within under 30 seconds we start putting them and inserting them into the database so that Airflow can show them in the UI. So now at this point we should be getting close to a thousand DAGs in R. At give or take what happens is with the random generation of those names there's a little some collision sometimes so it just gets a duplicate but generally we're gonna get about a thousand DAGs now here inside this demo which is awesome so now that it's all working for us we can actually take this to the next step which I will start now because it'll take a hot minute so I can show you the rest. So let's go ahead and change this up to 5,000 so we can get to our 10,000 DAGs. It may or may not finish by the time we are done with this talk. But luckily it's running all asynchronous and it will all build a building in the background and I can start showing you that we're generating 10,000 DAG files right here inside of the directories. That will take a minute. We will do this for now. So that's exactly what we did here is we pre-computed the configurations and all the data needed during the build. One last thing I wanna show before we kinda get along that area is when we were doing the first part we build the, nope, not build DAGs. Oh, I missed the worst case. One thing we also did inside here to kinda exacerbate the problem and something you should always keep in mind when you're building Airflow DAGs is you want to optimize as much as possible the time spent in the DAG code itself. So we'll see right here, we're actually introducing a sleep to show off the problem because a lot of the things that we actually learned along the way with the first project we did, we were trying to be smart and cloud native and we were storing parameters and key value pairs often to cloud storage pieces that we could actually bring them in dynamically. So if we had a new data source, the infrastructure's code could deploy new key values for that data source, passwords, any kind of data needed to generate or use for the access to the sources or destinations. We were doing that inside of a configuration app config this is on Azure, so we're using app config for that and we were making app config calls inside the DAG to populate data into the DAG so we could actually access the source and the destinations. This is a terrible idea in the end. You want to hard code as much as possible into your DAG file. So if you would have the opportunity and you're doing what I'm doing here which is generating those DAGs, use that to hard code in, the code will look terrible, that's not what you would want to write as a Zen of Python master, but it will get rid of these small delays. These small delays add up tremendously when you're actually going to import this dynamic DAG so you can see we've got this sleep of like 10th of a second here. And when we, with this delay, I can get about, well, I get about 50 configs before my machine starts to fall over. You can see that I go to, without the delay, it still maxes out, so even if I take that delay out, this artificial spending time doing something, calling an API service, bringing in some, even we found there were a couple libraries that were slow, like there was a JSON library we had chosen that was slow, so do a little bit of profiling to understand if you've got some slowdowns in the code because that'll greatly affect your ability to scale the scheduler. So even without the delay, I'm still only getting to about 1,200 configs or 2,400 DAGs, which I've already shown 1,000 here really quickly with the non-DAG factory method, just the DAG factory has a lot of overhead associated with it that you just can't seem to work around. Okay, so keeping the things as simple and hard code as possible is really what helps us with the copying, pasting this file and just using the file name as a way to figure out the ID for our DAG. Let's see if this thing came back up. That's a good sign, sweet. Yeah, so we're already, it's over 2,000, so the hard part is over, like we don't have the error about the 32nd, we've actually imported into the system over almost 10,000, like 9,100 DAGs in the system. What's happening now is it's basically taking the time to write it into the metadata database, so that number will probably till the end of the talk, will continue going up till it gets to about 9,079, I think is what we end up getting with. Which is nice though, the system is now usable. If the DAG has already been imported and imported into the metadata database, or if you're making changes, it doesn't have to rewrite those into the database. New DAGs added or new data sources added will show up pretty instantly because right here we're basically dumping 10,000 all at once onto the system, and I'm doing it on my laptop. This is not even like a Kubernetes cluster in any kind of cloud anywhere. If we only had that 23 kilowatt CPU chip, we could probably get this done instantly, but M1 Max is pretty incredible. So that's now working, and you now have another problem, which I will show you. So we pre-computed and we've got our solution, but we do have a new problem in that good luck running 10,000 tasks at this point, especially if you're using standard operators. It will fail miserably. Actually, I'll show you an example here on the code. It's a little backwards because I was playing with it, and these things just take a long time to run and get some interesting data with. The part on your right is actually with using async operators right here. The part on the left is synchronous operators. So just using the standard out-of-the-box operators, if you're doing the tutorial and you just want to scale this to 10,000, well actually if you want to scale this to just 1,000, this is 1,000 tasks right here on this line, and it's taking, you can see how many hours, to basically whittle through and work on those tasks. The problem we're running into here is the system's actually mostly idle. The example code, yeah. Oh, so this is the synchronous, and this is the asynchronous code. So I'm sorry it's reversed from what I would want to show you, which is like the bad problem first and the solution second, but we were so excited we had it working asynchronous. We're like, wait a minute, what does this look like synchronous? Because we have already solved the problem for the customer. Let's go make it bad and see how bad it actually is, and it's pretty bad. We're talking for 1,000 DAGs, many hours to run, and this demo code that is in here is basically just a Flask API that's emulating as if we had called to like a Databricks API to go run some transform on the data. So with basically the machine is completely idle, but it is out of resources in Airflow to do any work. So it has to wait, and we introduced like again a fake sleep inside the Databricks API to simulate what this work may look like. So this may be a good test bed for you all who are doing something similar where you're typically using Databricks maybe to orchestrate another data transformation library like Databricks instead of having Airflow itself do the transformations, which Airflow is an amazing scheduler and I love the way it works, which is why we use it this way. But you can now use this demo code that we put in this repository to go and tune and try out things. Like you can actually set that Flask demo worker, which if you look over here in the code, that is in the Docker folder here. There is the mock API. You can just literally change this how task duration or put in some kind of random task duration to now simulate what your work may be like and actually be able to work on your laptop instead of having to write some code, push it someplace, pray, come back, write some code, push it someplace, pray. I hate that loop. I wanna be able to work on my machine with my tools and this is an email list to actually explore a lot of that tuning options inside of Airflow to see where we can actually make improvements and get a lot of benefit. And one of them was basically moving to using deferred tasks and I'll show you why. So as you see this thousand tasks, it takes a long time for it basically before it even starts chewing down through them. What's happening, if you look at the bottom graph, so the top graph is tasks, the DAGs queued, actually it's DAGs queued, so that's the number of DAGs that are in the queue to run. The blue line is DAGs that are running and the green lines are DAGs success. So how many were actually successful as we're going. And this is again what we benefited from by upgrading into the latest version of Airflow and including in this demo Grafana and Prometheus. The bottom graph here, the green line is the important one. This one here at the top, there's 30 slots in the scheduler for Airflow to do any work. And so anytime you start or click the plus button on the DAG and say I wanna run this DAG, it goes and occupies a slot. And no other work can go occupy that slot while Airflow is sitting there in that spot. So what you wanna do and then you'll see what happens here, this is the bad run over here. And you see how it's basically fully occupied. The yellow is Airflow executor running tasks. So while all the tasks are running, green drops to the very bottom down here to zero. And so if your tasks are taking four seconds a piece or they're taking 30 seconds a piece, you're gonna have no slots available and it's just gonna run as slow as it possibly can run in serial while all the tasks run. As opposed to if you moved over to the deferred operators and used the async triggers, that's what the first examples here are. That same 1000 DAGs right here takes about, was it under an hour? Yeah, just a little under an hour to run. And you can follow the green right here. You'll see that it's now wavering, kind of bouncing up and down off the bottom. Because a deferred operator takes in whatever task, whatever DAG you're gonna run, kicks off like if you're calling an external API, in this case we're calling that flask API, throws it over into a separate asynchronous event loop. And then basically now you have an additional piece of infrastructure inside of Airflow called the triggerer. That triggerer is now calling periodically to see if things are done and not occupying a slot. So more work can now be scheduled and you can actually better maximize the usage of the actual CPUs that are in your system because we're running lots and lots of Python processes. So we can now scale up the number of workers to better utilize the resources that are on the system itself. And so that's where you'd want to solve this problem with the deferred operators. And that works with any type of operator. You just need to change the base class over to a deferred operator. And then you want to write the triggerer. If the operator doesn't include a triggerer, you would write your own triggerer that would run in that triggerer process to figure out when things are done. And this really could be a whole nother talk called Too Big for the Scheduler. But I think this should get you started and down the right path for success with Airflow. Again, it's all about speed and optimization around getting your DAGs imported as quick as possible. I may have time for like one question. All right, we're gonna do one question. You're the big winner. Hello, thank you for the talk. You're welcome. Quick question, have you tried pushing it even further? Because now I guess you, instead of having your DAG factory create all the DAGs and then having them in globals, you put them in files and you wait for the scheduler to read them all. Have you considered maybe like generating all the DAGs, the DAG objects in Python, pickling them, saving the binary and then having the schedule just read that? I've not, we've not gone that far. Or maybe like multi-threading, I don't know, I'm just ideas popping up from your statement. No, it's interesting. I don't know. Since we're doing crazy things anyway. Crazy things, true. I like to keep things as simple as possible until they're no longer work, which is why we still started with, I would still recommend anyone starting with the DAG factory approach. And then as it breaks, you need to now know, well, this is the next way I should go. Now with this one, I don't think about those Python modules that are in the DAGs folder. They're just generated. They're not checked into the code repository. We only generate those in the CI pipeline when we're packaging up the container to put it in a registry and ship. So I would not want to introduce any more complexity into the system, unless I had to. Thank you. Thank you. Yeah, thank you all very much.