 So thank you for the nice introduction. Yes, I'm a data scientist or software engineer at blue under In my day-to-day work. I work a lot with Python. I work with sequel. I'm responsible for the databases in our company You can find this the slides of the talk and some other stuff Some software that we have open source in our github repository at blue under on github When I'm not working I still like to fiddle around with Python And the Python community because I really like the community. So I became one of the co-organizers of picon de 2017 and 2018 Also a co-founder and co-organizer of the pay data Cosworth group we just started. So we are a small group at the moment, but we really plan to become bigger So before I go into my talk, let's pitch a little bit the conference We'll have the second time in a row the pie data in Karlsruhe. That's in southern Germany Near to the black forest. It's always a combined event. It's a pie data with a pie other picon de with a pie data track Last year we had about 450 people there we had three tracks with talks one track with tutorials We had a really nice venue with the center for art in media in Karlsruhe. So here are some impressions We had great keynotes. We talked about the universe and how it is made of and what it is made from We had in parallel the great art Exhibition the open codes and we had another great keynote speaker, which I was very proud of the founder of dusk Matthew Rocklin So probably if you think last year we had Matthew Rocklin and my talk is about dusk and pandas So probably who would be the nice keynote speaker for this year. So that's brand news We have gotten West McKinney as a keynote speaker. So he's the founder of pandas He is a PMC member at Apache arrow and parquet. So that's what I'm talking about and yeah He's also with the Apache software for the foundation So I'm very proud that we have got him as a keynote speaker Because it touches lots of area that we do in the company and I'm personally interested in We have already saw sold 150 tickets. So if you are interested, you really should hurry up because once we are Getting a schedule out Yeah, the tickets will just go by with a blink of an eye And we expect to be sold out once we push the Push the talks out So let's come back to my talk Today I will talk about Column data in a patchy parquet and how we use pandas and dusk to work with it API wise. That's not that much in Pandas and dusk to work with it, but I will tell you a little bit more about The challenges and the motivations we had in our software architecture To that we need to change And then we came to the conclusion that a pitch of Apache parquet is a nice storage Technology for us and I'll show how we use pandas in dusk to work with it and how it improved on our architecture in our company So what do we do at blue under we are a machine learning company we work in the retail space So on business model is that the customer that's a big retail change Chains they send us their customer data. So mostly sales and stocks and product information And store information We put all this data in a big sequel database then we have our machine learning algorithms they We train it on the historical data and then we can make predictions for the future about future sales in different stores and then we use this probably density function To build replenishment on top of it. So we tell Supply chains every day. You need to order for this store this amount of Products for the next day so that you are not going out of stock and you are not having too much waste That's all business model. We work quite with quite some data We use mostly supervised regression for our models so we have We get lots of data from the customer But we have also some other additional data sources like events or weather data And based on that that we basically build a huge matrix With features for each product location and date combination and we use the historical values to train our data to train our models and then Run on a daily basis or machine learning pipeline to each day generate new sales forecasts I'm not going any more into the machine learning today The only thing that is important. So Normally we have historical data from our clients about three to five years and that's up to a hundred billion records So that's quite some stuff Let's look at maybe if you go back one or two years in time So how was our data flow at our company? So We have this order. We have this huge clusters Of in-memory column now distributed sequel a databases It's proprietary a software called extra soul We normally run this with four to eight nodes database clusters and it keeps all the data in memory So the customer sense as data every day we put it into the seed database Then we have machine learning cluster based on Apache Measles and Aurora We run an open source software like Jupiter task or pandas on top of it We pull the data out again from the extra soul from the database to the machine learning And after this we insert it back in the database and then the customer pulls it from the database To give you some numbers if we calculate predictions for triple killer ease retailer That's about 20,000 products about 500 stores and we calculate Days of horizon so we calculate each day 20 millions of decisions For each day for each customer and all this data has to go out of the exit out of the database and go go back Into the database again So that's where we already see where we have a bottleneck on the one side We can really trivial parallely size our machine learning algorithms based on clustering on product groups or locations So we really can scale this out to 500 workers with no problems But in front we have this database where we can at max pull out data with 10 or 15 connections And then the data database is one full load and even worse We want to insert back the data into the database because of full table logs. We only can insert with one connection at a time We are inserting with a piece about 25,000 rows a second if you calculate it up We already use about two hours only to get the data back into the database If you working in the retail space And you look at the operations clock from a typical retailer Two hours delay because we need to get all the data in the database That's a huge amount of time for them. So we figured out okay. We need to go better. We need to be better Our conclusion was always Python is a really good companion for the data scientists. They are really happy with it. They like to build Data data models all the stuff. They really like it But it has not been the best to move large data in and out So why do I care or what do we think about? Most of our data are huge time series They do not necessarily need to be in a database So we would really like to have this time series in a distributed file system with a non-uniform schema We like to execute queries on top of it But most of the time we are only interested in a subset of the data and not all the data There are lots of Technologies out there that work very nice with this kind of constraints But they are mostly in the Java ecosystem, so they are not available in Python or like Pesto or Impala or Drill So that's amazing that they are available in Java, but we needed something that's available in Python to really get our data in and out So the obvious solution for us was two years ago, okay, let's look into a parquet as well File format and look into a blob storage as a storage technology to get the data better into our machine learning pipelines Two years ago There was not a really good parquet library for Python or With some work from us and others this has changed so we are really happy with this Before this was or this is a general problem in all all languages if you want to access Data data or the data formats you have often very good interoperability within your Ecosystems so in the Python world is mostly backed by an umpire But you have poor integration with other systems And theory memory copy is very fast You have a little bit of D on serialization on top of it, but it wouldn't be that big of an issue If we go back two years in time most of the time The solution for cross language data sharing was you we need to go with CSV But CSV is not a file format. It's just text and Yeah, I don't like it. So The obvious solution is parquet Parquet is a binary column data format. I'll give you a little bit explanation what this means So typical records that we operate on Are always like on this day in this location We have sold this amount of products and of course in our real machine learning pipelines We have many more features, but it's always a little bit the same looking at this So if you look at traditional Database systems or the memory buffer how it is stored. It's in a row based fashion So you write row after row into your memory Where as columnar storage takes a different approach so it writes all The cells from one column after it and then the next column and then the next column This might not be the optimal approach if you are only interested to work with one row Because then the traditional memory buffer buffer is better You can select one row and just read the records that are with this row But the columnar storage is much more efficient if you want to work on huge blocks of your data because you can fetch all the data it passes into the caches of your CPU and you can Work with this blocks much better. So paki is a columnar storage It was started in 2012 by cloud era and Twitter So it's in the hadoop ecosystem or the born in the hadoop ecosystem the first implementations were in Java and The first one dot zero release was in 2013 It's now an Apache top-level Project and starting in fall 2016. There was the first pison and C++ support At the moment in the hadoop or in the big data Java ecosystem. It's the state-of-the-art default IO option So why should you use paki as I said earlier to columnar format? It's very good for vectorized operations It has very efficient encoding and compressions You can use predicted pushdown to only read the data that you need I'll tell you later a little bit more how this works and it's an language independent format So you have lips in Yava Scala C++ and Python a big advantage over for example CSV It's that it's a splitable Format so that you can only read parts of your file that you are interested in so for example in CSV You can't just to jump to the row 100,000 and read a blog of it And you have always to seek to all the data to see where you are There paki is as much better You had to have compression at a column level Especially with the column format of your data. This is very efficient and you have also rich meta data So you have a schema attached to your file and Statistics, which you can use to efficiently read your data At the moment there are two implementations that you can use in Python to work with paki The one is the fast paki library. It's more from from the task People they have implemented it and the other one is of the Apache arrow project. That's mostly driven by West McKinney What is Apache arrow, so it's a specification for an in-memory column not data format So it maps very good to the paki on disk file format It's an language independent implementation of a memory format so The implementation is in C or C++ and you can expose the API to two other programming languages and so you can the Support for paki was brought to pandas in paki C++ Without any additional code to read it and you use can use the Apache arrow project With also other technologies like spark or drill so Let's have a look at the paki file structure As I said, it's an on disk format each file Splits up into row groups. So a row groups are about Five megabyte to one gigabyte of data and it's a number from I say would say ten thousand two hundred thousand rows within a row group You have column chunks So it's a column file format so for each column you write the data for each record one after another And then you have the page units within a column chunk that is used for compression and efficient storage and At the end of the file you have statistics. This is On the meta data of the file This is a very nice feature because with this way you can stream the data into a paki file Just keep records about the meta data and the statistics and once you are done just write the statistics So you don't have to write the statistic at first or even worse keep all the data in memory before you write it But you can stream it out Now if you want to work with pisan and paki file, how does this look like? so it's Very easy you just have the pandas top-level function read paki you can either pass it a file handle or file name and the first thing if you are interested in in In efficient read of paki files is that you should only specify this Columns that you are interested to read in so if you have a paki file with like I said earlier Five columns and you are only interested in the day and the sale Yeah, if you pass it in you will also only read the data from the disk From the columns that you have read that you have specified in the column projection in column order This is also a feature that I contributed to pandas. That's been my first Contribution to pandas. I'm pretty proud of it, but it was basically just passing it through the underlying libraries Yeah, and as you can see if you pass the columns you will only read from the disk the columns that you requested What's next what you should care about if you want to work with efficient with paki is predicate pushdown So it also you don't not only want to read the columns that you are interested in but also Split data skip data that is not relevant for you relevant for you. This saves of course I oload As the data D doesn't need to be transfers and this also saves CPU cycles Because the data doesn't need to be decoded Example here if you want to know Which products are sold in dollars you can only choose to select the columns products and the the dollar column and Then you can filter out the euros and if you have the statistics on for the on file Representation you can only read the cells that you are interested in Predicate pushdown is at the moment only available in the fast parquet library in with the python interface It's planned for pyro 2 It's already already there in the underlying park AC plus plus implementation, but it's not yet exposed to the API in python or to pandas So what you can do and you care have a very Yeah, easy filter syntax so you can specify the column an operation so equals bigger smaller in not in and then a value and This is then used to Skip the reading of certain parts in your file For an example if you use so you can switch the engine between fast parquet and pyro In pandas to read from the files. So if you want to read The data and you're only interested in certain locations You can use this to Skip reading of whole row groups and just read the row group you are interested in How is this done if you write the data to a parquet file? It also writes statistics and meta data about the row groups so for example here it writes for every Column the min and the max values and if you read this meta data first Then you can use this to skip the whole row group if it doesn't match your predicate This works only very good if you work on sort on sorted Data because otherwise if you have randomly distributed columns more or less in each row group you will have The min max values from your whole distribution of your data So if you want to use this feature You should make sure that if you write the data to disk that it's in a sorted order and then this would really speed up You could your read but otherwise you won't gain that much Another trick that you can use on the one hand To minimize the data storage size on disk and also to speed up reading later is to use dictionary encoding This is a very good fit for pandas categoricals So if you have for example For your product The color of your product and it's always from the same category like blue red blue yellow You couldn't can use a dictionary encoding and you can specify this and parquet will then store the Data with your dictionary encoding on disk So you only store the name of your value once And then just store pointers to the value and this is much more efficient and on top of this You also could use runtime run lengths encoding on the value. So you this will also get smaller and then if you want to filter again Based on on the color or here I have the example was a sales type and then you can just check in the dictionary for this row group if Your predicate Matches this row group then reach read the data and otherwise not read it So basically the dictionary encoding you can work with it like a bloom filter, but you don't have the false positives it works the Performance penalty for adding dictionaries is not that that that high. It's just about 1% of the reading reading time To to use it all The values in the column need to be dictionary encoded so for all row groups and all page chunks really must use the dictionary encoding And this is especially helpful if you have not sorted The column so you still can benefit from partial reading of your data so again if you look at the Dictionary of each row group first and then you can decide okay to skip a whole row group and not read it from this And this will massively speed up your applications Beside compression and encoding on a column level par key has compression on the whole par key file So you can shrink the data independent of its content. It's more CPU intensive than the encoding But encoding and compression works better than compression alone and Available are G-SIP snappy and broadly Broadly is the newest one of the compression engines If you don't want to go deep in what what compression engine is the best for your data I'm just a stick whistle that snappy and that's what we have seen Here are some storage comparisons that we have done So it's I think it's based from the New York taxi data records The columns are I think a bullet a time of load and I end and we have tried to Yeah, store it in different in different formats and You see with par key. You are really you have a compression of factor 10 and you still Read it much faster than for example CSV Another trick that we very often use and this time we'll switch to dusk is data partitioning So until now we have always read from one file and just to read parts of a file This is fine. If you work on a local machine or work with the data only from one consumer But as soon as you have a distributed system and work with Multiple consumers and especially with multiple producers. It makes sense to split the data into multiple files There is it's not a really a standard, but it's based on the hive partitioning scheme So that you basically encode the partitioning scheme in your file name or in the past So for example, so we have a directory date 2070 0 1 0 1 and another directory within it location equals location 0 and then you have multiple parts of your par key files and With dusk you can create this kind of structure But also dusk can be used to auto detect the structure and read the data back in so you just handle it to dusk Directory and it also Yeah, it recognizes the structure of the data directory and you can also use predicates to Predicate pushdown in this in this way to only read the files where the predicate of your file paths match If you're working in a distributed system Soon or sooner or later, you have to move away from your local computer and split computation from storage The approach that we did was so we didn't want to host the hdfs files to him on our own So we are running on Azure and then we decided to use the Azure blob storage to store all our data Or the par key files You can pass into to par key or to pie arrow either a file handle or a local File or something that behaves like a Python file object So this is an example if you just get a stream or a file object from the block blob service This is the way how to access Azure blob storage then pass it into the par key retable And then go to pandas and to get the data into pandas This works very well if you want to work with the whole data But you'll lose if you take the streaming approach you use all the benefits that I explained before That was predicate pushdown and only to read the data you're interested in so we have also Written an interface for the Azure blob storage this implements the Python file Interface so we have an IO interface where you can tell where you can seek where you can read We have open-sourced it as part of the simple KV library so simple KV as an abstraction layer over different storage technologies and Yeah, this helps us to gain the benefit of predicate pushdown and also store the data on a remote file system Another real-life improvement that we have seen with Apache arrow So as I told earlier Architecture is that we need to you to get the data out of the database and then analyze it in pandas The data in the database stored in a columnar form Until two years ago if we wanted to pull the data out We used pi odbc as an layer to talk to the database But pi odbc Works in a row wise form so all the data was transferred from a column on data storage To the row wise and then when we passed it into pandas it was again converted from the row wise to the column wise What we did and probably what many people do if they want to work with data in pandas Yeah, take the ugly real-life solution that was to export the data from the database as csv Bypass odbc in total and then read it back into csv Yeah, but csv you all you lose all the type information and it's just a hack. It's not a real thing how to do it So what was our solution? One of my colleagues Michael Koenig he implemented turbo dbc. That's an odbc Implementation for python that can also work with columnar data and also supports the Apache arrow standard So you can get the data from the database in a columnar fashion and then expose batches of py arrow tables To the pi to the python world and you could either then write it away as parkey again with nearly zero Memory transformations or you can get it into pandas and then work with this If you are more interested in about in this technology or turbo dbc in general You can see the talk from Michael Koenig last year the python or you can check out our tech block where we have some really Deep dive how I implemented this So Where are we now? So we moved away from the direct access to the database because it was the bottlenecks. So we now Export the data from the database into the azure blob storage and then from there we run we run our machine learning models This is now the huge benefit that we really can scale out our machine learning models because the azure blob storage scales much much better than our database and The same thing with inserts On the azure blob storage. We don't have to synchronize the insert Every every job can just insert into into one blob and then we can Work on the data to get it back to the customer also a nice benefit is that we now use File format that is shared by different implementations that we can now also use hive pesto or drill To do analysis of the data and that we don't have to store these large time series in our In-memory database, which is much more expensive than the azure blob storage. So the azure blob storage For terabytes. It's not even yeah, I don't 10 or 15 euros. So that's that's nothing and the Dequery engines on the on the bottom right This are ones that you can fire up for the analysis and once you are done down You can fire them down again, and you don't have to pay like the database The whole day to keep it running So what we have learned is safe in one load in another's ecosystem This really helps us but always persists the intermediate data So if you're interested in this topic, too, there's still lots of stuff to do in the Apache parquet Project but also in the arrow project. There are still lots of functions that are not yet exposed to pyson or pandas or dusk so Yeah, some of my colleagues Apache Members and they work on this project, but there's still lots of things to do That's it It's both my talk using pandas and dust to work with law column data sets So if you have any questions feel free to ask me now or later or just ping me at the conference If you have questions, can you grill on the side of the room, please? I'm really happy. I sit on the left side of the room Thanks, I remember two years ago. You were also Jero Python, but you were talking about spark Yes, I spark back then now I see is that picture of your stack that doesn't have pie sparking anymore No, but it's replaced with desk. Yeah So just a question in that sense desk is ready for primetime Not I would not say not yet. So just go back to pie spark Also with arrow pie spark has really improved all the pyson access in inside spark has really improved Also two years ago if you go from the JVM world to the pyson world You have very expensive and coding and decoding every value and it was really that slow that basically you can't use it at all I think this bark 2.3 They have also introduced arrow for the conversation so that now you can work in pie spark in pyson We sort of get pandas results and the whole serialization is done in arrow So that's much faster now, but two years ago. It wasn't available. So we went for the desk direction We are still not using task everywhere in production So we still have an owns get you law and some own libraries But we plan to do it and we do it because it's much more lightweight and much more fits into our stack because we are a Python company So that and everything can still be Python Yeah, so that's the reason for us the footprint of the ask is much much smaller than the past Thanks If you have a question each come here, we're not passing the mic from sorry Sorry Great talk. Thank you. I wanted to ask about the turbo dbc I think have you tried to use the support for rescue alchemy in pandas to actually get the data like that? So we we turbo see all turbo dbc also works nicely views as scope alchemy But you always so SQL alchemy is other you can use it from pandas But you always need the driver below to talk to the database and we either use turbo dbc or pi odbc And that's the thing that is fast or slow So if you want to use SQL alchemy in pandas, that's just a layer layer above the other stuff Thank you I would some questions from my practical year of your keys There's one in our, you know data pipeline flow There's one step we need to read in the the fell which is in nice to json json format Nicely json for me So the first time we try to use a pandas pointed, you know read json read json But the json that function cannot handle the nice to the problem So so the solution is Quite an evil basically. I have to write today like a into no very static stake static Library myself to handle this true strings But you know all these things and kind of a stable so make a transfer them to the Pandas data frame. So I was thinking Whether in future you want you you can or you would like to think about so maybe adding the feature like a Really in the nested json So maybe the solution could be firstly you you have an analysis the structure of json to identify what is going on And then you use some like a string cutting mechanism. I mean it's from my It's my temporary solution for that. Yeah, so basically parquet Supports nested data structures nested it nested you so in the parquet files You can really store nested data structures like nice to the json something like that Yeah, as a not chasing but nested data structures parquet can handle this But I don't think this will be exposed to Python to pandas because pandas is a tabularly two-dimensional Yeah interface so it doesn't fit the ball to nested data The reason is because the the nested the json Will finally turn out to be a Two-dimensional, okay, so that's why I think is it be good to consume, you know and the transport director data frame That's why I read a library to I mean it's an internet library to directly transfer this json nested json file to edit frame and pass to pandas and Yeah, okay, I think that's the way to go like you did it. Yeah Do we have another question Then I worked so let's go downstairs and okay, so I went to thank you again Peter for the nice presentation