 So, well, my name is Federico Matulie from our town not far from here So it's it's a town near Modena and it's very nice to have a rope Python so close to my own town this year Yeah, I mean I'm I'm here to talk about a project I did with startup in London and It's one of the first project I did that was Very data focused So essentially like I'm here to talk about like some of the experience I had with data in the last seven eight months and Yeah, some of the tools are used and some of the choices That we made both in terms of by architecting the system and in terms of like using the tools So the talk is about is about Brashift which is a database and an airflow which is like a framework to Which I'm going to explain later So my background is is actually kind of web development and and I was done like a lot of DevOps in the past I've used mostly Django I've done another presentation at aro Python three years ago, and it was all about Ansible Which is a tool like I really I really enjoy and I encourage you to to have a look at it because it's quite nice But then like data pipelines arrived and Then I had to like we had to choose like a different stock. It was it was a project where like a lot of a Lot of data processing was happening There was no web component and essentially we decided to Choose a different stock and we use airflow, which is a Project that was kind of born within Airbnb, but then he got incubated by Apache and now it's like an Apache project And then we use like some proprietary tools like Amazon Rashift, which is like them kind of the big one and also S3 which is Like a like essentially like an infinite hard drive. It's like an object storage and it's very well known So we started from like Obviously we started from a product problem about like and and the product problem was about Tracking actually like the tendons of Conferences around the world like and we are focusing on technology So we wanted to understand who was going to what conferences in order to like present this data and attract sponsors like for conferences As I was like when interesting project But they're really like the technical question that we wanted to answer was Are we query and cross-reference like multiple external data sources? Because most of the data that we were working with Was not managed by us. It was external So we kind of needed to Bring the data in our control There was a lot of a lot of integration so it was multiple external data points and we just wanted the freedom to query and cross-reference of this all these data sets and Really the answer to this question is It's actually in principle. It's quite simple like you just have to download all the data feeds So you can you have access to the kind of raw data So you can do any sort of operation when you have the raw data Plus like you need to load them in a system which allows you to query the data in the way you want This system can be Can be anything can be Excel can be PostgreSQL but we decided to use Russia because Kind of our use case The kind of the type of queries that we wanted to run Were like like a good use case for a shift So yeah, I mean essentially those are the two things they have to do but Depending on the complexity of the project sometimes There are multiple outloaders the multiple databases and so essentially like you need to make sure you build a good system to manage To manage data and here we talk about data pipelines Which is a which is generic term and it just indicates like the the system that kind of supervises So the downloading or the loading in the in the database and all the kind of transformation that you do on the data And we work on this Pipelines the kind of following Like some principles like three or four print three principles first of all We Needed a system that was able to scale to any number of inputs We need the system that was generic enough to Be easy to adaptable to any sort of feed any sort of data coming in with any sort of mechanism and You can think of it as a as a mixer So like essentially you have a lot of inputs and then you just mix and match and the way you want it This is opposite to For example like a web app where this side like a plurality of outputs like different output for different users But here like it's kind of reversed. So you have many inputs and you have Like one output, which is kind of the database that in the structure that you want in the database and So it's very important like for data pipelines to do all the processing in stages because you know, you might actually have like quite a complex and Set of transformation transformations and so if you think in terms of in terms of stages It becomes very easy to Kind of state what sort of input you have and what sort of output You expect the stage to produce And I just build your pipeline In a way that each each stage does a particular thing that produce a type of process And then if you do it that way you know, you get You know, you get hopefully your usability of stages you get You can separate development stages like between different teams and so on and another Thing other principle that we use is that we should Archive everything especially the inputs of the system Because storage is cheap like, you know, you can store a lot of data on S3 and and you don't pay them much So it was very helpful for us to have like an historical record of everything that that was going in the system and The other purpose of it is that sometimes you you have bugs in your code and and For data pipelines you want the ability to kind of rerun your codes on data that was I Don't like a month old or two months old because you realize you have the bug that we're talking about has been in production for like two months and Then you want to You discover the bugger you want to you want the ability to go back and fix the outputs And this is something that you can only do like with data pipelines We can't really fix an HP 500 like in your kind of web application, but you can do this with data. So It's quite convenient that you have that possibility Okay, so the talk the tool we use is is a tool core airflow It's it's a project that is spinning cubator recently in the Apache sort of foundation It's a batch processing framework Opposed to like a streaming framework Like storm. I know yesterday was a talk about storm but here we talk about batches and And essentially the big difference is that Like with a streaming architecture everything is every transformation is Cadula the start and then data Goes like real time through the stages But with a batch processing Framework is different because each step is only scheduled When it's actually needed So yeah, it's kind of a different architecture like in my eyes, it's a bit simpler and it's also like them Probably the oldest type of architecture More sound Airflare is a big community and there's a lot of stars on GitHub and lots of companies using it But yeah, I mean really airflow allows you to kind of build this sort of network of Interconnected tasks that you run and you can be like very complex network and You can have things running parallel things run a sequentially. I mean this example. Yeah, are you Which is some like an example I found on the internet Essentially you get approval files and then you start three part of the scrapers that are scraping Different things, but then at the end of every scraper. That's an update which is specific to the scraper and With every you can build these sort of networks of any complexity like Some some quite complex and some a bit simpler and It's Python all the way down so like you it's written in Python and And you use it with Python But then you will say like why not using crown? I mean crown has been around for ages The problem with crown is that it's very simple and for our use case It was too simple like there's no way to define dependencies between jobs As in like only start job be if job base has been successful You can do that There's no retry mechanism and this very very simple arrow reporting So it was it was too simple for us So obviously we decided to use air airflow and So the way you stole it's very simple. You just peep install airflow as it's written in Python initialize a database and which is So effort is a very stateful application essentially tracks execution of every everything that you've kind of put in the system and And all these information is written the database soon. Yeah, you need it just needs the database And then you start a web server And then you have something like that so this is a UI of system kind of we have in production and Essentially, it's just lists all the workflows that you have Workflows is term that is roughly like a crown job So every every workflow has a name as a as an owner and as like the schedule if it's daily if it's hourly if it's You know, you can define it can be very flexible in defining one something needs to run And Yeah, I mean, it's actually quite simple and the UI is quite powerful. There's a lot of things you can do with the UI But I wanted to show you like how do you actually use airflow so airflow as I mentioned before it's written in Python and You essentially create these fires called dogs Doug is a is an airflow term which essentially is workflow means workflow and You create a like you instant you create a file you with Python coordinate which essentially instantiates a dog class with the class Kind of the workflow name And then you pass a few parameters the kind of the one the most important one is the schedule interval What do you want it daily when you already hourly and so on and then you start to compose your workflow by creating Tasks, which are instances of operators the operator is is the thing that Kind of tells airflow how to run this step and Yeah, I can mourn that in the next slide. So essentially we run true commands to bash commands in this case and one is to Generate a report because the workflow is about And kind of generating reports for the business team. It's just like a like an example use case and So the first step would be to generate a report and the last step would be to email the report Come to the business team The second task needs to always happen at the end of the first task and only if the first tax task succeeded So essentially that's the way you do it is just declare a dog You could declare the two steps and then at the end you just say like teach you set upstream T1 Which means that T2 is always executed after T1 Yeah, essentially I've become this very simple dog You put them the dog in the dog's folder or fair flow and then it just gets picked up automatically and I mean for this dog, it will have be executed hourly and then the at the end of every hour you have an execution For like this different types operators and you can lose bash commands you can lose Kind of Python functions That's an example there And this is like the the two most common things like that the kind of we used in the project But I know you can we can use like doc this locker operator that you can use to launch dog containers and there's also something called sensors which essentially is a Specific like a special type of operator that is waiting for an event to happen Which might be like an file uploaded to S3 and or any other Sort of thing that you need to wait for Everything is that templated in ginger so you can have for example like kind of Python files with you know Essentially variables there that can be substituted on the fly by airflow and all same for for bash scripts I mean this is like quite alpha for example if you want to pass Variables from from air for itself to your scripts Or alternatively you can use like you can pass as arguments to Python functions like like the example here So I want to kind of double be deeper into like how actually airflow works internally and So everything as I said before is visible through the UI and so you have Essentially and you can see every execution of of things In the past and the result of it So for example if you have a war for the runs daily You will have in the UI the possibility to see all the all the statuses all the results of Every execution of the script and going back from today to to the date that the script started running and Yeah, you can zoom in and you can see for example for every for every task There is a task instance, which is like a database construct which essentially stores the start time stores that time the duration and the state Whether it's been successful or not Yeah, it's very informative especially if if if like kind of one of the main things the company does is is data processing So how does airflow like runs all your tasks and so you can you can choose like mainly two types of scheduling one is kind of scale up using like a pre-fork model like Like what Apache 1 nginx that will all the eunuchs demon do essentially they just have a parent process which folks like four or five try processes and then the parent process and like pulls for new jobs and When you have a new job a new task the the try process will execute the task and track the result and write the result back in in the kind of our full database and and This works very well for like one server if you want if you have the need to scale to multiple server You can use kind of salary executor But I guess it's definitely more complex. You need to have a message queue. You need to Be more installed But you can definitely use it So regarding all this information that that the system is tracking and Yeah, like essentially like you have A lot of information and one which is the state So you want to you want to know when things are running you want to know when things Have finished running and you want to know whether they they were successfully run or they were they failed and if they failed Airflow as this kind of mechanism of like retry you can definitely specify like How many times you want the dog to be retried and You can also be more granular than that and you can specify like how many how many times do you want a specific task to be retried and If there is a retry mechanism, there's a couple of states one is retry and one is failed and Retries like especially means it failed one time, but like I'm going to retry in five minutes like a delay they can set Yeah, those are kind of the same machines the states So another concept I want to like introduce is that and it's something that and kind of differentiates airflow and crown for example is our airflow deals with time and specifically downtime so the default behavior is to backfill one like Where else we're supposed to run but did not and which essentially means If you have a crown job like a workflow that is meant to be run daily, but then your servers stays or flying for a week When the server comes back online essentially airflow Realizes that for the last seven days the job wasn't running. So we launched like seven copies of your script and the difference between all these executions that Essentially there's some variable a past to you to the workflow call execution date and This variable will contain the date of like seven days ago six five four three two one days ago And and this variable is meant to be used from your scripts to to limit the amount of data that you process I mean, this is like like a very good behavior for for example like can report generation Like for the business team because the business team doesn't care whether the server is up or not It's just wants to receive a report every day But it's actually quite bad for scrapers we are running like a lot of scrapers and We don't want to run like, you know seven scrapers at the same time. We just like First of all, we just kills our server it kills this ever and and we don't get any we get excited the same data So we started using an operator. We subtly was released released in the the latest version of airflow, I think it's called latest run only operator and Essentially skip so the past run is in all your runs there like the latest Copying Okay, so Yeah, that was about airflow like essentially the way we use the air flow is to to To do like extract load transform and more specifically Because of the tools we picked we have like a bit more of a general structure like a specific generic structure and So we started a generating best of these Because we want to track Every run of it and like our code And we want to see what is a specific execution what was sort of data a Specific execution is generating and and putting a redshift So we generate the batch ID we create an initial folder structure We run like kind of downloaders and structure like, you know, whatever we want to run And this is normally the stuff that takes the longest Now whatever comes out of that we compress it we upload it on street and and we load it in a shift That's yeah, that's essentially like the generic steps that we we used to follow and So I mentioned batch IDs and and for us was like I Mean generally it's quite important to to generate a batch ID because you want to track Essentially You want to be able to map back the data they generated to the specific run For debugging purposes or like if you're gonna fix a bug and go back and fix the data That this bug broke like you need to you know, you need to know what batch IDs were involved And so you can remove the old the best of these and you can regenerate the data and Another reason to have best ideas is that airflow may rerun some of the steps including loading a rush shift So it's very important to have To make this operation it important so it doesn't matter how many times you you do it The result would be always like once one and only one load in the table So we time stamp on the data again very useful for like debugging You want to know what time it was downloaded you want to know what time was written in a street Yeah, and so on so and so we did all this operation we got to you know, kind of run out our downloaders we got to and like We got some data on the hard drive. We have we compressed it. We sent it to S3 But then like Because we want to use Russia and it's quite important that we pick the format that Russia supports So there's two family of Formats in Russia one is like column based and the other one is row based I'm not going to talk much about column based first of all like it's a bit less common in the Python world And I know this parquet and this or see I think But the support in Python is not great is it's not as supported as row based formats and Also, like the row based formats are like generally more common So row based formats are Essentially three and one is CSV one is the other one is Jason lines, which he which is like a bunch of Jason objects separated by new lines and Last one and also like the most recent one is a format called Avro and which is like an Apache standardized format So CSV is very simple everyone knows what it is Unfortunately, it's sort of dependent. It's flat and it's quite hard to extend Unless you all you do is append data to the end of it Which is quite limiting and it's also typed So, yeah, it was a bit simple for us Jason lines again, it's very simple. It's everyone knows what Jason is It's easy to extend because you can just add like keys to the to the Jason object and and You should not that touristy work and So be verbose Unless you if you don't use compression, but like you can use compression. So we can remove that It's scheme less It doesn't enforce a key being there or key being there or specific type which might or might not be a problem for you and So the last the last format is Avro and it's not so common is It's it's crazy to extend and I Think with our this the schema so you can enforce keys being there. You can enforce some types Unfortunately, like some bits more ecosystem and in our project. We decided to use Jason lines So it's very simple and it was good enough for us So we got to the point that we Kind of save all this data in in a rush shift and then let's talk a bit about rush shift So rush shift is a is a product created by Amazon is no open source and Essentially what they did is they fought Bosco school and they made it work for all up workloads like analytical queries But they changed quite a lot they changed the query plan and they changed the storage engine and Essentially, it's like it's like buying a car and changing the engine. It was quite a big change and they made it Work well for huge volumes of data. Yes, very good like on the fire compression support But then it's like a column of database so which means For the kind of things that you would expect Like a row based database to be fast actually a right shift is quite slow Given the right volumes of data and and it's the opposite is also true for the things that You know, there's law on Bosco's they quite fast on rush shift For example aggregate on single columns Having like many values in that column is actually quite fast for rush shift but quite slow on Bosco's quite for example If you have if you do lots of like insert updates on single rows and Definitely is lower than post as well if you select a single row and That's all you want all the columns. That's quite slow your rush it and it's because Like the biggest difference is really like the storage model and the rush shift is a column disk Uses a column this layout and post cases as like a row this layout Which really means like when when you have a table like The way post is writing in a disk is like traversing each each column row by row like in kind of a horizontal way and Yeah, that's the result you get but like with rush shift the the traversal is actually Reversed so it traverses all the values of a column and now when you research the end it does the second column the third column and so on and I Mean this is pretty fundamental and that's that's why performance is reversed and And yeah, and that's that's yeah, that's that's why queries. There are slow I'm posting squalor fasting rush shift The other big difference is that the block size Is normally quite big So you can load a lot more data and memory at once while with post a squirrel you have You have a different different like many different cycles for like many different Loading steps and I mean considering that the database and Like the most critical part of a database is loading data from from from an artist to to the memory I mean, this is is quite big Okay, so that was redshift and like how do we actually use it from from our system and So a redshift has very good support for S3 and if you have the files on S3 of a format that Races supports. It's just a matter of like executing a square query Is the copy is a copy command you specify destination table you specify Where data should be read from I specify whether it's a JSON File or not. I mean in our case, it's a JSON. It's a JSON lines file and GCP if it's compressed and What this query does is like four things one is reading data from S3 Once it's read like if the compresses data on the fly and then it does something called like JSON path flattening Which I'm going to explain in in a second And then whatever comes out of this third operation is then appended to to an existing table So it's a loading loading this data in the table So Jason path flattening is essentially An operation that takes an S destructor like Jason You know with Jason you have like key values, but then a value can be another object So you have like another set of key values or can be an array and Essentially like You have a lot more nesting But then if you think about rush it to rush it is still like a relational database You have tables and tables are like a flat flat structure So like the way you I Mean the way we decided to do it is using Jason and Jason path which essentially When you apply this This file to a Jason lines object, it will generate Like a sort of CSV with five fields The first one being the value of of them of the key ID second one being the first element of the ray of the value of data one and then you have Subfield of data to and so on So at the end of this you have a faster today can just append to a table and that's that's the way we did it And I think what mentioning is that Like you know for for for any software project the schema of the database is quite important We wanted to have the schema kind of versioned and storing it like you probably would with any other Infrastructure code if using Ansible you want Ansible to be to be in your repository if you start form the same thing and The scheme is just another is not the one of those assets that you really should put in get an inversion So yeah, we definitely use a school alchemy And We don't rely on a squalemate too much to do or am operations because it's not the use case of fresh shift and I mean you can still Create queries using a school alchemy like we don't really use models in that way and The other benefit of having a school alchemy is that you can it's just nicely integrates with migration framework called a limbic which is quite complex and and quite kind of featureful But essentially like with a migration framework. It's really a true potential date that you tend to use One is a limbic revision which generates a migration files and And once you have these migration files you can Run this migration file on the current schema to upgrade the schema or downgrade the schema depending what kind of direction you want to go and To apply this fires is just a matter of running like a limbic upgrade And as I said, it's quite it's quite a complex framework and it's a possible multiple environments And how to generationally work sometime because reshift is definitely as a lot of as cases is a bit weird sometimes especially for example when you alter columns like You cannot really alter columns in a shift you have to drop columns and read them Which has a lot of consequences Which gets me into these annoyances and The ratio that's a lot of things which are quite annoying. For example, like the voucher length is expressed in bytes instead of characters While for proscuses is pressing characters So sometimes you have as cases where like you, you know kind of scrape sites with Japanese characters and then suddenly like the loading breaks because you know kind of the Japanese characters are mostly Using the unicode extension. So you have you probably have two bytes instead of one byte and that was really annoying for us You can all the column types as I said before Referential integrity is not really a thing in a rash shift so primary keys and foreign keys are not really enforced But that's okay because we don't we don't for our use case. We don't need it Columns are notable by default, which I think is different in postcards Is this system future proof? Yeah, I mean we resonate with it Russia's pattern came out which seems very interesting because it would essentially mean that we would We could potentially skip all the loading in rush shift Because spectrum reads data directly from S3 And This is like a couple of open source projects, which we might look at and one is called press to the B Which again is is able to run queries directly on S3 and With how you I mean to kind of injustice files in a database Which is quite nice and We don't stream to S3 which might be like a good thing to look at And there's another project which I want to mention. It's actually an extension to PostgreSQL called Cetus and it's done kind of by a company in in US and Like if you are found a PostgreSQL and you want to kind of use role Like the original PostgreSQL with all the extension that that you kind of learn to like You can use you can look at this company Cetus data is called Yeah, thanks a lot. I think that was that was it And I don't want to mention like I'm actually looking for work especially on kind of this sort of stuff because I really enjoyed it so if you have you know, if you have a project like that, you know kind of Talk about it. I'm happy to talk about it. And Yeah, I mean that's it. Thank you. Thank you very much for listening Now so any question to Federico and yeah, thanks for your talk. All right. Thank you