 Hello, so nice to be here and finally participate in the CSV conference. I have we should so for a long time I'm going to talk about today with you all about little something we have been preparing Relate to processing data flows and data pipelines using open source software We use with using Apache airflow and a little package We created name and test ETL First a little bit about myself My name is Augusto Herman I work at Brazilian government in the Ministry of Management and innovation in public services and I for the past years the decades I have been I Had I have worked in the past and open data I have helped create the open data policy for Brazil also the launches the dad is that go Gov.br portal Brazilian Predator portal from 2010 until 2018 But more recently I'm no longer working with that. I working in data engineering team making Data lakes at the secret ad for management And we have been using almost exclusively a freedom of open source software there for the past three years You can find my presentation on this link I'd like to also thank you thank thank the our team data team at the Ministry of Management, they spared no effort to Work really hard to make this version easy to use and Available as a package that you can easily install so And fast ETL is a Free and open source plug-in for Apache airflow and a bit I will talk about airflow itself It has features for Easily more easily processing data pipelines and you can speed up the development process because using all the Some of the functions that we provide some of the operators hooks We have little bits for many different things Why do we create such a thing? we wanted to Use it internally to develop at our own data pipelines. So to make it faster we reuse code that we have Used briefly previously on other data pipelines and We have been using this code since 2019 and We took the parts that we were most common and the part is this library and we love free and open source software And we wanted to share Knowledge with other data teams and expanded network of developers. So if you Like it and begin to experiment with with it. It would be great to get some feedback and contributions So what is Apache airflow Apache airflow is a free and open source software That has a vibrant community. It has frequent updates. So that was part of our decision using it It's already version two we two points five Something we started using it back in the first version. It's basically used to Schedule tasks and orchestrate them Define some tests to usually usually take data from somewhere clean it and load it somewhere else Or it can be actually anything this test and you set the order of them and this order is called Orchestration and you define when it should run and It can also trigger other processes. So it gives you visibility into the extract transformation and loading process and Also, not notifications when something breaks something goes wrong. We use it we receive Messages in our slack team when some pipeline broke and we need to work in fixing it So we use airflow and facility out to periodically synchronize Several data sources can be databases can be Even spreadsheets we have these are not just Examples, but we actually do have data sources in Excel in Google spreadsheets in SharePoint and then we load it into the data lake and The data is then used for the analysis for Auditing processes in government We even using machine learning to detect outliers things like that And also to create dashboards for the people in different areas of the ministry to Figure out and helping the decision-making process. We publish data on the open data portal Using airflow as well We notify people when there are some Specific terms or their names appear in the official gazette like the government has the official gazette publishes Laws and norms and some people want notifications on a list of terms appear that and then we save an email for that We can also send notifications about many things And slack matches messages as well We also use it to manage our own sprints on the team At the end of the sprint everyone receives an email with every task they have done we use it actually for managing the Remote work internally so we can need to feel reports and what did we did and This list of tasks everyone did is really helpful for that So it can the facility L plug-in for effort can create create replications of tables database tables The can be full or incremental in SQL server Postgres databases my SQL also can be a source It can load data from Google sheets and from File share sharing in Windows and of course it can extract CSV from SQL It can clean data using custom data patching tasks like For each table I have a process that will clean this type of data We have Sometimes message or graphical coordinates. I have some examples for that And mapping canonical values for columns like when people fill out some form in free text and we want to make mapping tables to connect the values It can also query official Gazette API as I mentioned we can use use it also to calculate distances in Map from OpenStreetMap using the OpenStreet routing machine, which is also open source software just for that We use also the CKN API to publish open data you use to use because the that's not what we are Scrap is no longer using the CKN API, but it's own API. It's also supported this new API is also supported And we've been using Frictionals data tabular data packages to write data dictionaries dictionaries in open document text format So here's an example of using the DB2DB operator Apache Airflow works with a concept cause how cause hook and operator It's an abstraction that you just configure a connection with some data source and then you can use the hook To manipulate that or an operator Won't dev into detail line because I don't have time much what each of those are, but you just configure the Connection information the scheme on table you want to copy you say if one whether or not you want to truncate The destination table if you want to copy the table comments, which are which are the descriptions of each column some other so you can as I said copy the full table or Just incrementally It's much faster copy just the difference Destination can be pros and cons SQL services Also, the sources can be my sequel. I want to show you one example use case that takes from Entrance data transformation from the massive data on the data sources until the Final Data published it as open data and data visualization. So this is a project that Brazilian government uses since 2016 For a ride systems in taxes for public officials to get to meetings in other places So there have been over 900,000 rides since 2016 There are over 35,000 users which are people who took take these taxes and it's used by more than 160 public organizations in all levels of government. So here are some numbers from the data sources There are 29 heterogeneous databases with different table schemas five different Suppliers which are the companies who manage the the taxi systems And Sometimes it's the same schema, but if it's the same company, but it's a different company usually different schema many of fields are of unstructured texts even For example the reason the alleged reason for taking the taxi or which government organizations that sometimes We need to map that information to They cannot do the real existing Government organizations So this is the schematic that airflow generates of the whole Pipelines like each of those processes is called the DAG. There's a direct icicle paragraph Like Processes that Extrace transforms and loads data It starts daily at 745 a.m It then cleans the data with FACETL patchwork with that is that collection of data cleaning processes for each specific field There's one for geographical coordinates. There's one for CPF, which is the number of That identifies Brazilian citizens Etc. Then the data that has no privacy restrictions is published as open data and these whole processes finished Before 8 30 a.m. Every day This is an example of the cleaning part. So above you have the part that cleans geographical coordinates We have things like Using wrong decimal separator for latitude and longitude or wrong signal or values met by an arbitrary number like 1,000 etc but also the other like The Statues motive and other fields that are also cleaned in this processes this is how we are using actual codes for Calling the FACETL patchwork to clean the geographical coordinates. We just pass Data frame and Our file rather we read from a file dances in the base frame Pass the source The schema name which are the configure the parameter keys which columns that have geographic coordinates And it does all the cleaning We also read table metadata from The database like the table descriptions in column descriptions and we then Record it as a file a data package from frictions data Which looks like this you see it's the same descriptions here go there and then we used the document template to data dictionary operator to take this and generate Text documents containing the data dictionary with the same Column types and descriptions. You'll get also the type information is very important So this is the open data Publication deck for this particular data set You can see that some tasks can run in parallel. So Apache Airflow managed this as well you configure the amount of workers you want to set for processing tasks and According to schedule and availability Airflow schedules and run each tasks and returns a status. This green symbol means that it's Has had to success in the task If you get an error it gets another caller we click then and we can see the log file and Usually there's a helpful error message. You can use to fix it Then This is an example of using frictionless data to read the Open data, so we just import the package class and pass a URL and you can query any of the tables here This example are the rights I have done myself in 2018 It's updated daily. You don't see more recent because I have stopped using it just This is an example of visualization visualizing this data and it's actually interactive and You can show this Show this later because I've won't have some time for some questions. You can find facet. Yeah I'll just get to Robert repository to install it. You must install these dependencies and then you can start as a package and These are some things I use for my presentation and if you want to reach out to our team at the Secretariat of management because it was just email and My own contact is in my website and now I glad it'd be glad to answer some questions either in English or Portuguese or the experiment We have time for one or two questions She can API I want to ask you why Well, that's a difficult question. You'd have to ask the current managers of the portal since 2019 I'm no longer involved with the management of board portal. I wish they see use the second API Have some question What what would an organization need and infrastructure to set up and use Apache airflow? Well, you need some We need you need you can use it you in Kubernetes or Also Depending on our infrastructure. We do not use Kubernetes yet use brancher 1.6 which runs on Docker containers, but You basically you just need like a virtual machine and then set up the That depends actually on the load want to off your process How much data you are going to process because I talked to some people while ago that they Processed huge volumes of data. So they started using Apache airflow But they were using AWS for deployment But then their process grew so large that that their bill started to grow to costly so they decided to move to Lighter software tracks, but for our needs Apache airflow has been running very well. We use three virtual machines with 16 gigs of RAM, but we have over a hundred Processes of ETL ETL. So that's Being worked with us. Thank you so much. We are at time. Thank you again