 Cool, yeah I just want to say thanks everyone for attending my talk titled automating airflow backfills with Marquez. So we're gonna go through a few use cases on when to apply backfills when backfills makes sense and how you can use lineage metadata to know which downstream dependencies are affected by the data outage using Marquez. So quickly I just want to introduce myself. Hey I'm Willie. I'm a software engineer at Astronomer. I work on the data lineage and observability team so we focus on providing tools to understand where your workflows break, specifically around airflow. I'm the co-creator of Marquez which is an open source metadata service that is now part of the Linux Foundation. I'm also a committer to open lineage which is a standard for collecting lineage metadata for a job under execution. And if you don't mind follow me on Twitter if you want to continue talking about metadata or anything related to it. For this talk we're gonna do a few things. We're gonna talk about backfills and look at it in the naive way. We're also going to look into open lineage and do a quick introduction. We're gonna do an intro to Marquez as well and then we're gonna end with backfills. So it's sort of a take two on it. In this case when we say take two we're gonna be looking at using the lineage API that Marquez exposes to begin to traverse the lineage graph and know which downstream dependencies were affected by maybe your job failing. And then go into finally I go into some future work that we're working on at unopened lineage where we want to extend the model and also some of the APIs that are coming for Marquez as well. Now before we dive into that you know I just want to go into a pretty normal scenario. So before astronomer I was actually at WeWork. Well before astronomer I was actually a data kin which was a data lineage startup. So as a founding engineer there where we developed Marquez. So yeah WeWork we had a pretty standard approach to analyzing our room bookings. So I just want to go through a quick example on how we approach room bookings and some of our analysts and the queries that they executed and look into how that could be an example for our backfilling scenario. So let's get booking. When you want to book a room at WeWork you first look at the location and floor and here we're just looking at really far in the future. We want to book something at Salesforce. There is a time that you want to book it. So where are the open time slots. Also you want to book a duration and then once you kind of say okay this room at this time looks good for me. I want to come for my booking. And analysts have a really common question or in this case they want to know which which WeWork spaces are doing well. So the question that they ask is which locations has the most room bookings. And here since I'm in the barrier we have we focus specifically on locations in in in San Francisco. And really what the analysts really care about is you have a set of room bookings and they want to know the top location. And what they usually do is they write a query in this case is really simple but they usually write a query and they give it to the data engineer and say okay go ahead and productionize this. And here we just do a select on location. We get counts. We want to know who is booked by and we're reading from the room bookings table and we're grouping by bookings and we're just doing a limit of 10. So we feel pretty good. The analyst now has a dashboard that their stakeholders can use to kind of predict which locations are doing well and forecast that. And as a data engineer I feel really good. I productionized that sequel and was able to use air flow and define a workflow that allowed me to run this periodically and make sure that I could observe if the pipeline is failing or anything or anything specific to issues in the work workflow. But dags do fail and backfills are a thing. So what I mean by that is you know let's say we view we view the execution of a workflow every month and we have those data points but also we see a sudden drop. You could ask yourself well is that because the data quality might be an issue is it because we you know we just didn't see a lot of room bookings during that time. So there's a lot of questions that you begin to ask yourself. And one of the ones that come up or one of the key issues when you see dag failures is around data quality so data freshness. And what I mean by that is how often is your data updated is it updated every hour of every day. And if so is there a missing partition for that for that time period. So there's data gaps that could be a result of that. There's also with data quality you know there are data sets they they change and they change often and what I mean by that is the shape. So columns get added and columns get dropped data types change. So any any one of those scenarios could result in an issue and while your pipeline might be failing or in this case why our dashboards are not looking correct or at least we assume they don't. Also bad code I don't know about you but I don't write bad code anymore so I would be pretty surprised in terms of why the dad would be crashing because of bad code. So you know we iterate a lot and specifically we modify workflows and we push them out regularly sometimes we don't but most of the time there are things that our analysts want us to update in the sequel so we have to push that code out. And that might be that might be the result of why we see this issue but we're not sure yet. But bad code being able to understand what the reasoning behind the drop in bookings could be the result of just a recent code push. And the other one is just back a dag dependency so upstream failures. And what I mean by that is you know on your edges if you if you have workflows they're doing details of pulling it from s3 and loading it into your warehouse or maybe even some external source and loading it into your platform any of those could have failed which results into cascading effect downstream. So before I go further I just want to talk about backfilling and how you define that. So as your organization scales up and the amount of data and number of internal data sources also increase the amount or the type of data problems that you're equipped to deal with or want to deal with or have to deal with also scales as well. So you as your team as your organization grows different teams begin to form different sources begin to get introduced so you could have redshift you could have snowflake postgres there's a lot of different data sources and different formats. So and as data outages happen they become more disruptive. So if you have a small team if there's an attitude you're like okay you know you can communicate within that small team but imagine if you have an organization of a thousand people and you have critical reports that need to go out every week. If data outages happen that now that communication becomes a bit more difficult. You're either slacking people or you're emailing asking people what is going on why is there why do my passport don't look correct. So finally we define backfilling. So backfilling refers to the task of retroactively processing historical data and that just means filling in the gaps. So filling in the gaps is that you either you could do full incremental or full processing if your data or you just process a subset of it. So you can think of it you process weekly if you have weekly data and you're missing a day you just process you just backfill one day of that. So having a central place to analyze and understand DAG dependencies will make your organization more resilient to data outages and that's where Marquez comes in. So Marquez is a metadata store for lineage job and dataset metadata and we'll go into that in later slides. So for data outages what usually happens if we break down the time of recovery or when it gets identified 90% of the time is going to be downstream so downstream of the issue so that you know that could be a customer where you know they have a dashboard in their app and now they see things incorrect or they see a drop in or a discrepancy from what they saw last week to what they see currently. So 90% of the time that's where the issues come up which is not a good experience and 10% is around the code based detection. So if you're reviewing code you're able to detect oh wait that sequel looks wrong or that code you might want to add a test for and that's where manual testing comes as well so a very small fraction becomes detected before it makes it out to the end user. One example of this is you know you push code out to production then a week later your analyst comes to you and like and asks what is going on all my all my dashboards are incorrect so on the right that on the right you see the result is days or weeks past before incidents are detected and resolved so then you have to ask yourself when did that bug get introduced or when did we start seeing issues in our in our data or in this case the when did the data quality drop. So Airfield does provide a pretty simple way to do backfills it's there's a backfill commands so you have a start and end date and then you give a DAG ID so you could run backfills at any time just using the CLI. But there is an important that before I go into it a bit deeper the execution date is kind of critical here because I got confused early on about execution dates and you have to wait a period so if you have 24 hours before or there's a 24-hour window you want to you want to run your workflow every 24 hours 24 hours have to pass before it gets triggered. So I don't know I got confused when I was initially working with Airflow and when I was running backfills so I thought it would be important to point that out and what I'm going to do now is just kind of walk through different scenarios that happen so you have your job and your output data set and in this case we have a data quality issue with our input input data set and you can think of a workflow that is pretty critical so billing and payments so if you have a an input data set that is either missing data or might have duplicate rows so you're overbilling customers that's a pretty big issue. There are some toolings that you can use out there currently to create assertions for the input data set and that is great expectations. Does anyone use great expectations or have any okay but it's one one one integration that we do have with Open Lineage and just talk about the different ways to collect assertions from your data set. So one one way you could try to fix this is just retry it so if your data if your output data set or your job is right away where it's idempotent you can retry it so it's like oh maybe maybe the data is there now but that only works for so so much or you know that only solves the problem temporarily especially if maybe the data wasn't complete but now it is and now it is complete but just retrying that has its own issues and like I mentioned if your code is not idempotent you're going to have duplicate rows in your output data set so you start looking and debugging and you're like oh wait there's an upstream issue so you know the only thing that you see as a data engineer or developer is just to have input data set that's the interface that you work with that's the schema so you know the shape you know where it's stored but you don't necessarily know who owns that data set and if you had a lineage graph or be able to understand the dependencies you could look upstream and realize oh wait there's a job failing and I don't need to look into this anymore it's not my code it's actually something that's going on upstream the other case is one bad data point so one bad in this case partition so you know when you read data sets it's not just a full data set usually partition it based off an hour or by the day so a job could be failing because one partition is now missing or the you're reading in the data and that partition has grown significantly and your job just can't process that so your job starts failing and then the output data set is now either missing data or it's arriving in glade and you can imagine consumers downstream of your output data set are now also having issues so you have this cascading problem and for those of you who are still writing bad code you're pushing code out that now results in your job failing and even though the input data set we know that there's nothing wrong with it now your job is failing and it's the result of just some bad code that you pushed out recently and now your output data set is also delayed or there's gaps in your data and as I mentioned before now you have this cascading issue where anything downstream depending on your output data set is now having issues and that results in dashboards that are critical to your organization especially around growth or turn are no longer available to your CEO and that is not a fun meeting to have so backfilling is tough and what I mean by that is how quickly can we detect data of quality issues and then explore them and be able to identify them and then eventually resolve them so currently there's not a lot of tooling out there if you think about in terms of microservices or when you're deploying software there's a lot of tooling and metrics that you can take a look at and especially around downtime so when you have apis that you expose there's so much monitoring around that and for jobs the interface is really that data set so anytime your workflow produces a data set that becomes your interface that's how other workflows interact with the artifact that you output and in terms of what are learning rules should you have in place when you when you actually identify that there's an issue you know a lot of the time I know uber did this and a few other organization that I know of they just run SQL and do checks on their tables that run every hour and if they realize that there's more nulls than what they expect they'll then trigger pager duty or they'll pick a trigger a message in Slack so that way people get notified but that's a common way to do it but being able to notify when the issue begins to happen will help you trace and debug the issue the the data problem a lot a lot sooner and makes everyone's lives a lot a lot better in the future so what effects if any would upstream dags have on downstream dags so we talked about this you know if you delay consumption that could have cascading effects on dashboards but if you're doing stream processing you know if you have a lot of data coming in and you haven't scaled up your workers to process that streaming data and you're no longer archiving to s3 you know that that just has issues all around especially if there are workflows that are dependent on those objects in s3 being there to load into your warehouse all right so why are backfills this hard they shouldn't be and this is when when I was at data kin the one thing that we realized what they were there needed to be a standard so I mentioned open lineage earlier but they needed to be a standard on how to collect linear metadata from your workflows to understand when your job fails what were the run args so what were the parameters passed into your workflow to understand which workflows are maybe using a parameter and they're all failing because of that and being able to understand the inputs and outputs the schema of your data set and understand when changes happen and also version your code so it sounds like a lot but that that's what the standard is attempting to do and if we take sort of a step back and look at the current landscape and you look at data analyst tools scheduling warehouses SQL engines they if you want to pull metadata from those engines you have to write custom code and that's what we did from our Kes and Amundsen does the same and a few others as well and where OpenLager comes in it provides that layer that allows you to say let me get linear metadata from warehouse and the the spec itself allows you to look at the event and understand exactly what query was executed by but also by what job so that way we kind of crowdsource it and we don't we don't have to worry about a new release coming out for air flow and not that having a lot of work on our end to then resolve that for that particular version our integration broke we now have a standard that collectively we can all contribute to and just make our lives a lot easier when it comes to processing lineage and now I want to get into Marquez so Marquez was a project that came out of WeWork that I co-created and it was a meta it was part of our data platform and it collected metadata for all of our workflows especially air flow and we were working on streaming as well so it's also on LFEI and data project that's incubating and we didn't read this paper until after we started working on Marquez but it's there's a really good paper that came out of the ground came out of Berkeley called Ground a Data Context Service so it talks about an API also a model for storing metadata but more specifically versioning and that's what was a really interesting part so I recommend giving a read if you want more background on just metadata and versioning quickly here we have just a guide with Marquez so we have data lineage data governance and data discovery Marquez does expose a lineage API for data governance there's things like tagging and for data for data discovery we have a search API that you can search your data sets and jobs so Marquez is a metadata service so it stores objects for sources data sets and jobs and there's a core API that allows you to run to do data governance and what I mean by that is just understanding how your data flows through your platform and also if you're tagging data sets as having PII which jobs are reading from that data set and for the data model the the job there's there's three components or I mean there's you know if you look at the open lineage model there's jobs, datasets, and runs and I talked about the grounds paper where Marquez really comes in is around the versioning so versioning your data set if your schema changes but also your job version so if your code changes you'll be able to know which run was based off that that job version and which run produced a data set version but also what was the input data set version for for that run and we have different source types that are supported like MySQL, Postgres, Redshift and many others and with jobs it's not just batch they're streaming and we're looking into also supporting services as well because you have operational data stores that are pretty critical. The design benefits especially around backfilling is just what job versions produced and consumed what dataset version so you have this multi-dimensional model where you know what input datasets and the versions that they were what were the job versions and also what were the dataset versions as well so then you'd be able to run full and incremental processing when you want to run your backfills so imagine if you did push that code out and a week later you know you realize oh wait now we have issues in our dashboards then you'd be able to go back and query the API and know what version or what code went out for that for that particular date and you're able to debug and see what the diff between the two code versions were but also what downstream jobs were actually reading dataset versions that were now corrupt and need to be reprocessed and so Marquez is a push-based metadata collection model so you push through a REST API and there's open lineage events that we push to the Marquez back-end and then we process those events and and store it into our model. There's different integrations that we support there's airflow spark dbt we're working on flink and iceberg as well but there's many other integrations that we're focusing on and in a nutshell when your workflow runs especially in airflow you're pushing all that metadata to the Marquez back-end and Marquez begins to populate that model so here you have a workflow that has the open lineage library installed and connects the Marquez back-end and transports that metadata to the Marquez REST API and we're going to go over the open lineage and airflow integration just because it's really important in the context of when we do the backfilling so what we do is we collect metadata around the task lifecycle so like the run args the run parameters we automatically pull in the code the link to code as well as inputs and outputs so we do some SQL parsing which the lineage tracking is built in for DAGs and we actually have a new SQL parser that we recently released that is based off Rust so it's a lot more performant it's a little bit more reliable and it's actually performing pretty well in production and the things that are built in is just that link to code so being able to know at what point did your code change because when you push new DAGs airflow does pull them down and especially if you have them stored in Git you'll have that Git shell and you don't have to do this anymore but the library is open source as part of the open lineage repo and you just have to modify your import but as of the recent release of airflow which i think 2.3 you no longer have to modify your DAGs it's just if you do a pip install it will automatically begin to extract metadata from all your DAGs and there's an operator extractor model so i just want to show really quick here for you know airflow has a postgres operator and open lineage has this extractor model that you can pull metadata from so here you have on the top airflow and at the bottom you have open lineage and that's where extraction happens builds the open lineage event and then sends it to the markets backend and if we were going to parse this and walk through first you look at the source so you look at in this case we're processing a postgres operator you do sequel parsing and that becomes your dataset and then your job so this task ID is your job and there's just a naming convention we have with airflow there but those are the sequence of events that happen and what we're really working towards is if you have two workflows that don't know that they depend on each other there's an underlying implicit dependency so you new room bookings inserts into room bookings and you have this top room booking stack that processes those process that table and being able to stitch that together through a lineage graph is super important and especially if you have different work if you're running multiple airflow instances and you have different dags running on those instances how do you end up linking them together and being able to look at it in holistically through a lineage graph will help you understand the dependencies so what i wanted to do is kind of take a take two on querying the lineage metadata for for marquez so here we have in step one you connect to marquez and here we're just doing a local host 8080 and then we build a node ID and there's a certain convention that we have for the node ID in the lineage graph so we're able to do a job or dataset and then it's delimited by a colon and analytics is the namespace but it could be it could be a team your namespace is our way to contextualize your metadata so analytics could be a team you could have engineering but just a way to group your data and then at the end the last component is just the job name so here we have top room bookings so if you take that node ID and you use the client and you do get lineage graph and provide that node ID what you'll get back is all the dependencies so starting at that node all the dependencies upstream and downstream from it and on the bottom it's printed out but what i'm going to do is quickly go here and i'm going to query the marquez rest api that we were looking at in python but the rest api if you just hit api v1 lineage and provide a node id well you'll get back is a graph so here we have i want to show we have a job that's under food delivery example so it's just a different example that i seeded marquez with but the job is called example dot delivery seven days and if we look through the graph the graph returns back an array of nodes so you have a you have the id of the data set or the id of the node you have the type and then you have the data blob so which is so depending on if it's a job or a data set you get different metadata back and here we have a data set so you get back different fields and if i scroll down the important thing is the in and out edges so since marquez keeps the dependencies between your upstream and downstream you're able to then look at the in edges or the out edges and follow those downstream so if we go back and here i just printed out the lineage object but if we go back this is really all you need to do if you wanted to know what jobs or what airflow tags you wanted to rerun or run a backfill for so if we look we do the same thing we connect to the marquez client and then we have this backfill downstream function which is just recursive so let's say in this case top room bookings is was failing you're able to start at that node and then follow it all the way down so you hear we do use the client to get the node id and for each out edge so for each out edge we will then go ahead and see whether or not it's a job if it's a job that means we want to run a backfill so the start date and end date that could be parameterized here i don't have a specific value but if you say airflow backfill start date and date and then give the id of the out edge because we know it's a job then you know that you have to rerun it and once you kind of go through that then you just call the function again so at the bottom you'll see backfill downstream of node id and the out edge has a destination so you just keep following it until you hit the end of the graph itself and what i wanted to show really quick that api ends up feeding the marquez ui so here you have a complete lineage view of this example that i seeded the mark the local instance of marquez but you're able to traverse and look at the different metadata for data sets ui built so we talked about versioning look at the different versions that were created if the schema changed or if it run writes to a data set that's viewed as a different different run and then there's a link to the run that created that version and we could also look at so the the squares are data sets and if you look at the circle that's a job so here it's a job that just ran select star from and you could look at the different run history as well so there's a few run runs that happen but if any of these jobs fail you could be able to visually look at the lineage graph and say okay there's a job downstream that was depending on my output data set and rather than reaching out to me it's like oh wait my job is failing and the data set needs to be repopulated you could automatically do it for them so automating automating the backfill for jobs downstream really removes a lot of the frictions between teams if they're okay with that sometimes they want full control and i'll end with you know just kind of failing collaboratively so being able to have a global view and understanding that teams you don't want teams to remain isolated so as teams run into failures instead of failing by themselves they should be able to learn from each other's failures and and be able to depend on a lineage graph to communicate when things fail and understand the overall health of their ecosystem or their data ecosystem and you want to coordinate efforts so if someone is working on an issue and let's say there's a data outage and someone else could be working on the same thing and you want to combine efforts if someone's really good at the infrastructure side and they're trying to figure out this data quality issue not being able to have a global view of how things are dependent you might you might be duplicating effort there and just empowering teams so if they if they did identify the problem through a lineage graph and but they don't have the pool tool they don't have all the tools to solve that just kind of makes things yeah it's they can't resolve it and that kind of sucks so being empowering teams to solve these type of problems i think it's super important and one way to do that is being able to use the lineage graph to build on top of it and do a lot more automation i did write a blog on this um so if you wanted to kind of take a look at the script feel free it's uh it's on open lineage and it's just called backfilling with airflow decks using marquez at the time it didn't there wasn't a lot of automation i didn't think about that so but this goes a bit deeper and kind of covers my talk as well and finally with some future work there's column level lineage coming i think it's already supported for open lineage but marquez we need to do some work to support open column level lineage just job hierarchy and grouping so understanding how when you return lineage metadata back there's grouping so if there's an airflow dag that has a a bunch of tasks how do we group that so that way displaying on the ui is a lot easier there's flint integration that's coming and iceberg support as well so if you're using those tools or frameworks i'd love to talk to you after and just want to say thank you for attending my talk hopefully you got something out of it at least maybe some code to automate some of your workflows and i wanted to end with just you know check out open lineage follow us on twitter same thing with marquez and i'll just end with any questions okay that's good maybe thank you