 Hi, I am Sakshi and I am going to give a talk about how you can improve your data quality using Apache Airflow Check Operators. So, first of all setting the context for those of you who are not aware of Airflow. There is a short introduction. So Apache Airflow is a tool for ETL orchestration. So the structure that you see here, this is a DAG in terms of Airflow. This is Airflow UI and this entire structure is called a DAG. These are directed acyclic graphs. These operators that you see here, these are the operators that actually gets the work done. So the jobs run here. So while the DAGs described how you run a workflow, operators determine what actually gets done. These arrows that you see here, these determine the dependency. So to the right of it are the operators and to the left of it are the parent operators. So only after parent operators are done, do the child operators start. Just to give you an idea about scale at Cubol, this is a typical DAG in Cubol. This is one of the most involved DAGs. As you can see here, this has 90 plus operators, that 2 of 8 to 9 different types. So if anything goes wrong in any of them, we might have data corruption issues. The biggest problem here is that it's very hard to debug what went wrong and where. So this is very error-brewed. This reminds you of the data quality issues that we ran into. So we had issues like missing data, so the data wouldn't be coming to us. There might be a bug in the pipeline due to which data would be missing. We had data duplication issues. The query might not have an item put in when we wrote it. We might be processing same data again and again. We had data corruption issues and some system issues due to which the ETL itself did not run. This brings me to the importance of data validation. So application's correctness depends on the correctness of data. So if your data itself is not correct, then the application is not reliable. This means that we have to increase the confidence on our data by being able to quantify the data quality so that we have some confidence, some quantifiable confidence on our application. Correcting existing data can be very expensive in a very cumbersome process. Those of you who've done it should be knowing this. The most critical downstream task is the data is invalid. This was the first attempt over solving the problem of data quality, trend monitoring. So the idea that we would monitor data, we would alert in case we see any anomalies, like we are suddenly receiving a burst of data or there is no data for a particular day. These were the kinds of trends that we alerted ourselves on. But we soon came to the conclusion that it's a hard problem. First of all, it's not real time. These trends can only be monitored once the ETL has finished, once we have the data. So it's not real time. We cannot stop it if there's something wrong. Also, one size doesn't fit all. So these different ETLs manipulate data in different ways. They have different critical fields. They have different data types. Some are partition tables, some are non-partition tables. Some would be adding new data daily, while for some we might not be adding so much of data daily and there just might be some editing of some of the columns. So one size doesn't fit all. Generalizing trends over here is difficult and so this becomes difficult to maintain. So whoever is writing the ETL also has to understand the entire structure of trend monitoring, entire architecture of another project, keeping them in sync, working on multiple projects at the same time. So it's very difficult. Also, this is not foolproof, like there might be a lot of false positives over here or we might be missing some alerts. So we noticed that if we were to delegate the task of data validation to the ETL itself, then all the three problems that I've mentioned here, I'm sorry, all the three problems that I've mentioned here about it not being real time, generalizing the trends and difficulty in maintaining multiple projects can be solved. It would be specific to an ETL. It would be real time because it would be running along the ETL itself and it would be easy to maintain because the ETL writer doesn't have to work on multiple projects and just have to write a particular validation for their ETL itself. This is when we came up with the idea of using assert queries for data validation. So just like we have assert statements in a unit testing framework, we could have assert queries for data validation. While this is simple, this has generated immense value for us. Here, while we were exploring the implementation, we came across these check operators. Let me talk a bit about these check operators. So these operators expect a SQL and a pass value. So if the output of that SQL is not within some predefined range of the pass value, then the operator would fail. So this is what our approach has been. We extend open source Apache Airflow Check operator for queries running on Cubol platform. We run data validation queries and we fail the operator if the validation fails. So after this, we have the flexibility of failing the entire DAG if the operator has failed, if it is critical enough, or we can continue on the path if the check is not very critical enough. So this we can tweak the operator using something called as trigger rules in Airflow. So if need be, then the subsequent task and succeed can begin to run even if the parent operator has failed. Let me digress your a bit to talk about Cubol data service, because we need to know what Cubol commands are from here on. So this is a self-service platform for big data analytics. We provide Apache tools like Hadoop, Hive, and Spark integrated onto a platform. So you can submit your queries on these engines. So now when I talk about Cubol commands, these are essentially the queries running against these data engines. Coming back to the check operators, this is how you create a Cubol check operator. The command type here is Hive command. So the query is gonna go run against Hive engine. And the output of this is gonna be matched against the pass value. If it doesn't match or if it's not within some predefined range, then it would fail. Now when we were starting to use these operators, there were some limitations that we encountered. And there were some enhancements that we had to make for our use case. Like for example, we wanted to compare data across multiple engines. But the problem was that this pass value needed to be defined before the ETL starts. This use case was pretty important for us because we were doing a data import from our Amazon RDS instances into our data warehouse. So there was no way of knowing beforehand what the expected value in warehouse tables would be. This means that we cannot, this means that if the pass value, you're forced to hard code the pass value at the start of the ETL, then it's not possible for us to compare and get the value from source engine. So the solution here would be to make pass value an airflow template field. So this way it can be configured at runtime. The pass value can be injected through multiple mechanisms. One of them including getting it from another operator. And once it's an, so if it's an airflow template field, we can do this. So the approach would be, we run query against our source table. We get this pass value, we inject it in the check operator, running on the destination table. This is how we can compare data at runtime in two tables. This is how you define an airflow template field. So the pass value that you see here, this is a Jinja template. So this is a very simple example just for a demo case, just to give a demo. So one minus one here would be valued at runtime. So actually what we do is we get this value from another operator at runtime. And so we're able to compare across engines. The second is we wanted to validate multi-line results. So currently the operator considers single-row for comparison. That is the first row. But for our use case, we wanted to run group by queries and compare each of the aggregated value against some pass value. The solution here, so in our use case, we wanted to compare again second column essentially, not the first row. So Cubolcheck operator, we've added a parameter called results parser callable. The function pointed to by this parameter holds a logic that returns a list of records on which the checks would be performed. So it's upon us which row to use for comparison, which column to use, how we want to parse it's upon us. So this is how you give parser query result. So here we're returning values of second columns against which the check would be performed. Now how we've integrated these operators with our ETLs. So this is the first ETL, data ingestion. I've already talked a bit about this. So here we're importing data from RDS into our data warehouse for analysis purposes. Now when I talk about importing the data, it's not just simple plain copy of RDS that we are importing. Because we've added some optimizations. We don't want to import entire tables daily. So we've added some filtering logic over there. We've added some absurd logic in Hive and things like that. So historically we face issues like there would be mismatched with source data. There would be duplication. Data would be missing for certain duration due to some bug. So the checks that we've employed here is we've added count comparison across two data stores, source and destinations, using the approach that I've mentioned before. So these checks have helped us verify and rectify our absurd logic. The second is data transformation. So these repartitions are days worth of data into early partitions. So at the end of a day, we should essentially have 24 hours of partitions. So historically we face issues like data would end up in a single partition field. You would have wrong ordering of values. Our data would get corrupted. So the checks that we've employed now is that after a day's worth of run, we check if the partitions getting created are indeed 24. We check the value of critical field at a source. So you see here this is repartition yarn. So the source here should only be yarn and nothing else. If there's anything else then we know that something is on this. So these checks have helped us in rectifying our repartitioning logic. This is the most critical ETL that we have in QOL. Here we're doing cost computation for our customers, which is measured in terms of QOL Compute Unit R. The current situation is that we're narrowing down the granularity of computation from daily to hourly. We're going to tell them hourly that this is your cost. So these checks have helped us in monitoring new data and allow in case there is any mismatch with the old data. Because here the source of truth is our old data. This is the last ETL that I'm going to talk about. So here we're parsing customer queries like QOL commands and output their table usage information. So the output information like which tables are being used frequently, which columns are being used, recommending the column while they're writing the queries. Some of the table's frequency is too less, so you're not using this table anymore and things like that. So historically we faced issues like data would be missing for a certain customer account. There would be data loss due to different engines and different versions of those engines. So our customers, they operate on different engines. And even if we were to fix an engine, they're working on multiple versions of the same. So we face issues of data loss due to not being able to parse such queries, due to failure of syntax checking in such queries. The checks that we've employed here is we do a group by an account IDs and check if any of them is zero, then raise an alert. That means we're not getting data for that account. That means something is amiss. The second is that we do a group by an engine type and account IDs. If there's any higher error percentage, then we raise an alert. So these checks have helped us in gaining insights into the amount of data loss that we're facing because of not being able to parse the queries. These have also helped us in being a feedback mechanism. So like we know which engines data we're missing, we're able to incorporate that in our parser and make it more robust. Features of this framework include we're able to plug in different alerting mechanisms. So we can have an email operator or a Slack operator or different kind of alerts that you want. We have dependency management and failure handling. So if the checks are critical enough, then we stop the pipeline, we stop the ETL there and there. And if not, then we continue on the path. We're able to parse output of assert query in a user defined manner. Runtime phishing of pass values against which the comparison is made. And we're able to generate failure and success report. So this need not necessarily be an alerting mechanism. We can use it as a reporting one as well. Lesson that we learned here is estimation of data change is a difficult problem. It needs to be tackled with precision. And the second is delegating the task of validation to the ETL itself can solve a few issues for us. Source code has been contributed. Any questions? Just an announcement for the question. We have our next talk by Priyanka, who's going to be talking about user response protection. So do hang around for that. Do the questions very quickly. So actually, thank you for the talk. And yeah, I have a question. When you said that how do you differentiate between critical and not so critical validation problems? And when there is non-critical, do you use that for the analysis or like health checks? Yeah, and another question is when there are issues with like after a day worth of running, and then you get to know like something went wrong. So what kind of database do you use? And also, is there a possibility to roll back like in case because maybe like further days will be coming into your pipeline and then how do you handle like old days and the new days all intercepted? OK, I'll answer the first one. So like I mentioned, if it's critical, like if there's something that we cannot reward, like we're computing QCUH and we update it to something called a Zora, so we cannot take it back from there. So here it's necessary that if anything is wrong, we stop then and there. So this is one of the examples of a critical ideal. The last one that I talked about, where we're using it essentially for analysis purposes, for making our parser more robust. So that's one case where it's not critical enough and we're using it over there. Does that answer your first question? OK. Second one, can you repeat one? So my second question was how do you reward? So when you have a day worth of run and then you get some issues, but then the next day would also have maybe in the pipeline already. So we're using non-critical for the transitioning phase. So till now, something was wrong, but when we added check operators, then we know that something is wrong. So this is transitioning phase. Now we want to analyze the results of that operator. So it's possible that our checks were too strict. So we want to analyze. And after that, we can stop the ETL. If we analyze and see that something is wrong, then we can stop the ETL. Hello. Thank you for the crisp and useful talk. Do you have any recommendations on how to do reporting? I mean, if you have a lot of issues, then how do you go back? You can't just put all of it in a slack channel or something, right? So how do you do reporting? Are there parts? Are there grades? Yeah, OK. So we have something called it's on success and on failure callbacks over here, and all the operators in airflow. So you can essentially add all of them to a task instance. We can do an excom push and put it to some place in DB. And after that, we can have an email operator or something or an alerting one. And we can collect all those results from there and give it to the user. So this is one thing that we were planning on doing. Hi. So where have you defined your validation rules? Sorry? Where have you defined your validation rules? And is it configurable? How easy or difficult is it to change it? Validation rules? Is it like you define that if you need to check some, put some condition based upon that, you will assess whether it's the right data or not, right? So where have you defined all these rules and how easy to change those rules? So these are essentially defined in the query itself, the query that we're running for validation. So there are some part has to be there in the ETL also. Like I mentioned parser logic over here. So this one, right? So we output, if you're not able to parse, then we output that information in the table itself. And then we do a group buy in the query. And we essentially extract all those errors. And we use them for our reporting, for our loading. That's how we use them.