 Hi, everyone. Let me introduce myself first. My name is Masahiro Matsui, Software Engineer at Japan Dense Science Consortium and Master Course Student at the University of Tokyo. Today, I'm here to talk to you about Alpha SQL, open source software providing dependency resolution and validation for SQL and data. I'd like to share a lot of what I'm planning to cover today. This talk will be divided into two parts. The first point is the problems of SQL and data I wanted to solve using Alpha SQL. And the second point is the introduction of Alpha SQL, which can be used for resolving dependencies between plain SQL and data and visualizing dependencies as a directed acyclic graph, parallelizing SQL queries and validating SQL and data. By the end of this talk, you'll be familiar with the problem of SQL and data, and we'll be able to easily solve those problems using Alpha SQL. So as a first point, I would like to talk about the dependency health of SQL and data. Dependency health is a word traditionally used for the frustration of some square users who have struggled with managing many dependent packages. However, the dependency health of SQL and data are also common in recent years. This is caused by complex data processing and processing for machine learning. For example, we can implement machine learning models using Apache Hive mode, BigQuery ML, Redshift ML and pressed ML. And these fantastic solutions can generate more and more SQL queries and data. The typical phase of dependency health of SQL and data, changing schema of SQL queries and data, changing schema of SQL queries and data, and cyclic dependencies, sorry, changing schema of SQL queries and data, and complex dependencies, resolution by hand, and cyclic dependencies. Let's see some examples. As a visual example, our project has a dark like this. It should be very difficult to manage these dependencies by hand. Simple example like this, actually more difficult than it seems to be because the dependency graph contains not only the SQL queries, but also tables. Which are generated by SQL queries. Also, things will get worse with user-defined functions. As you know, some software processes are referenced in multiple places and should be defined as functions to prevent duplicated code. This means that dependency graph will also contain functions and can be more complex. These problems explain the points. The complex dependencies leads to unwanted costs of software engineers and missing and redundant dependency caused by human errors. Also, changing table schema leads to dynamic type and schema errors in the dependency graph, such as missing resources, you know, tables, columns, and functions. And type mismatch, for example, between string and int. Changing table schema is more complex, as in some cases, upstream SQL queries change or change can lead to errors in downstream SQL queries. Some people might be surprised. But dependency cycles can happen in complex dependency graph. You can see in this graph, differencing tables as shown by the blue arrow can lead to dependency cycle. To solve all these problems, I developed AlphaSQL. The overall architecture of AlphaSQL is like this diagram. First, SQL files are passed and analyzed by AlphaDoc component. The AlphaDoc component resolves dependencies between the statements, creating table and statements, differencing tables by using static analysis technique. Complex logics can be written like this. In the same way, dependencies between statements creating user-defined functions and statements calling the functions are resolved. Depending on the analyzed dependencies, the AlphaDoc component generates a list of external tables referenced by SQL files. And DAG is dot language format. Dot language is language for representing graphs. Many programming languages have their package for parsing dot language, and it would be very easy to utilize the output of AlphaDoc component. Actually, the diagram I have shown is generated by Graphit-Graphbit-CLI2, which can generate graph diagram in many formats from dot language. This complex airflow graph I have shown as an example is also generated automatically by AlphaSQL. Airflow is popular workflow management tool, and the graph can be shown graphically on the airflow web GUI. Like this. Also, this graph can be executed on the airflow. This means that AlphaSQL enables automatic parallel execution of SQL queries. So far, I have explained about AlphaDoc component and the application of its output, such as visualization and parallelization. Next, as the last important component, I would like to explain the AlphaCheck component for validation of SQL and data for the whole dependency graph. As I already noted, the validation of SQL and data is the problem of the whole dependency graph because errors can occur between the whole graph. So the AlphaCheck component checks the SQL queries in the order of dependencies and validates the whole dependency graph by virtually creating tables. In our project, the AlphaCheck components actually validate our SQL queries and data in the process of continuous integration. The result looks like this. In total, 1,532 SQL execution results have been collected so far, of which 72% terminates successfully without errors. 14% terminate abnormally due to the errors not related to AlphaSQL and the remaining 40% terminate abnormally due to errors in the AlphaSQL analysis. Of the errors caused by AlphaSQL, the most common issues were errors due to the SQL not finding the different tables or databases with a total of 12% errors. The second most common error was type errors caused by the mismatch of the type of column or function used with a total of 3%. Finally, although very rare, there was one error that indicated that a cycle had occurred in the dependency between SQL and tables. This validation result is very helpful because AlphaCheck components only emit errors that will be emitted by type checker in runtime, and we can edit queries and change schema with confidence. The validation guarantees that there is no type and schema errors in the runtime. In this way, the all problems I noted was solved. Complex dependency resolution is solved by automatic dependency resolution. Changing table schema and dependency cycles are solved by validation component. Finally, I will summarize current limitations and future of AlphaSQL. The first one is that the AlphaSQL's main target is BigQuery. AlphaSQL uses the same parser with BigQuery and can read most standard SQLs. However, the validation results are not desirable in some SQL systems. In the future version, I'd like to extend AlphaSQL for more flavor of SQL such as Presto and Hype. Also, current AlphaSQL is CLI2 and doesn't have graphical user interface. Also, users can edit their queries using their favorite editors, and AlphaSQL can generate airflow dab. It can be better with intuitive GUI. As a summary, using AlphaSQL, we can automatically resolve dependencies between plain SQL and data, visualize dependency as DAG, parallelize SQL queries, and validate the whole fantasy graph of SQL queries and data. AlphaSQL is an open source software and public on GitHub. You can search AlphaSQL then visit the repository. You can investigate AlphaSQL more using the document on the top page of the repository. All questions are welcome. Also, I can support you on the Twitter. Thank you for your attention.