 So, I will be talking about scaling up data pipelines using Apache Parkway and Dask and the learnings we had while using this in our workflows. My short intro, I work as senior data scientist at Ecolab where my work is related to building scalable data products. I work mostly in Python SQL and data science. So, Ecolab is a global leader in world hygiene and energy services and technology. All over the world companies use our solutions to work towards more sustainable planet and we find solution to world's biggest challenges clean water, safe food supplies and abundant energy and healthy environment. So, typical machine learning systems have a data flow in which data is extracted out of systems. You do some transformation, machine learning models are fit and then this data is again loaded back into a database from where it is used for various purpose like reporting or visualization. Now, we can parallelize our machine learning systems based on clusters in data. For example, clusters on product group, customers, location and scale this out. However, problem in this is or the bottleneck is in extraction and loading of data from database. Also, the second problem is we can have multiple concurrent connections to extract data but we have to load that data back with limited connections. As we have heard Python in Python, it's used for various these purposes, web development, scientific computing, data science and in lot of these use cases, Python is often used to move data in and out of databases. However, it's not the best tool if when the data is very large and if it doesn't fit into memory. So, Dask with Apache Parquet as a file format can be used for improved performance. So, Apache Parquet as a file format, it comes from Hadoop ecosystem. It was invented in 2012 by Cloudera and Twitter. So at Twitter, they had these huge log files and they wanted to store them more efficiently so that they can evaluate large queries on them. Eventually, so in July 2013, it had its first release and it became top level Apache project because it got used into various Hadoop ecosystem projects. Eventually it had a general availability and in fall 2016, it had support for Python and C++ for read and write. It is a state of art format in Hadoop ecosystem for storing data and if you see lot of tutorials like Spark data frame tutorials, they usually use Parquet as a file format. So like I said, Parquet is a column data format and I'll give a bit of explanation what does that mean. So, for example, if you have a two dimensional data like this, so when you write the data, you need to pick whether you are going to write a data by row like this or which is a traditional format or by column. So, by column by column. So that's the main difference here. With row it's straight forward, you write rows like this and however with column it's other way around. So the benefit with columnar data is once you have data like this, you can encode a bunch of values of same type together so you can have much efficient and smarter encodings and also compression works much better when data is homogeneous. So for example, if you have text, number and decimal values to compress, if you apply compression on it, it's less efficient. Some advantages of Parquet because it's a columnar format, it's very good for vectorized operation. It has a very efficient encoding and compression which results in smaller file size. We saw a reason why and I'll share some benchmark afterwards. It supports predicate pushdown. What does that mean is with predicate pushdown you can only read the data that you need. For example, you can directly jump to 10,000 line of the data and just read the block of that data. It's also language independent format and hence if you have some framework in different languages it can still share the data. So there are two typical habitats of Parquet. One are the query engines like high presto and drill which uses it and others are framework like Apache Spark or Hadoop MapReduce which uses it for getting maximum performance from this framework. Also there has been growing support to use Parquet by Pandas and Dask with several benchmark showing that you can get really good speed up for your application. So most of us know Pandas, Dask like Jake explained in yesterday's keynote talk and he gave a pretty good overview of the library. And flexible parallel computing library. So with Dask what you do is you take a Python workflow which you currently have and you can easily scale it up. So let's have a look at Parquet file structure. As I said it's an on disk file format so the files are stored on your disk or in S3 or Hadoop file system. Now each file is split up into row groups so row groups are about 5 MB to 1 GB of data and it's number it could be from 10,000 to 100,000 rows. Within a rows you have column chunks, a row group have all the same columns and all the columns in row group has the same length and then you have these page units. These page units are basically within a column chunks that are used for compression and efficient storage and at the end of the file you have statistics that is the metadata of the file. So hierarchically a file consists of one or more row groups, a row group consists of exactly one column chunk per column and column chunk consists of one or more pages. Now the idea with Apache Parquet was an common need for an on disk columnar storage and there was a parallel project going on arrow which was related to common need for an in-memory format for in-memory format so that there is zero copy reads. So Wes McKenney who is a developer of Pandas, he wrote a great blog about it in 2017. I have referred a link to that saying Apache arrow and 10 things I hate about Pandas. So he along with other developers they started this project for in-memory format of data and to facilitate data exchange between different frameworks like Python C++ and all these frameworks. So this basically gives a clear picture. So with arrow what happens is you can directly utilize the memory with all these systems. If arrow memory is not there each system has its own internal memory and lot of computation time is spent between serialization and deserialization. So how does it look like to work with Python and Parquet? So the tabular nature of Parquet it supports good fit to read into Pandas or Dask data frames with these two libraries Fast Parquet and PyArrow. So Fast Parquet comes from a group which developed actually Dask, Matthew Rocklin and that group and PyArrow comes from the Wes McKenney's group, the arrow group. So reading a Parquet file in Pandas or Dask data frame is very easy. We have a top level function read Parquet and you can pass a file path. So if you are interested in efficient reading of files you should only specify the columns that you are interested. So for example if I have here three columns and I am only interested in the location and the sale I pass those columns and so data from Dask for only these columns is read. This is also called as column projection and similarly you can write it with a top level function called to Parquet. So data frames can be written in Parquet format. And if you see the syntax for Pandas and Dask it's actually similar. And we'll talk about compression a little bit afterwards. One other thing that you need to know if you care about reading large files efficiently is predicate push down. So when you are reading large file you want to read the columns that you are interested in but also you want to skip the data that's not relevant to you. How does it help? This saves IO load as data doesn't need to be transferred and it saves CPU cycle because data doesn't need to be decoded. So to use predicate push down you just need to pass a filter with the row groups with this filter syntax. So it will basically only read the predicate from which the data is present. So Parquet file supports predicate push down because when you write Parquet files it writes metadata and statistics like mean and max of the values. So when you read a Parquet file this metadata is read first then you can actually skip the whole row if you are a predicate doesn't matches it. However to use this feature you have to make sure that you write the data into disk into sorted order. Another important trick that you have to remember if you have to minimize data storage size on disk and to speed up reading later is dictionary encoding and filtering. This is very good fit for Pandas categorical variable. So for example if you have categorical data and its values are always high, medium, low you can use dictionary encoding. So what you do here is you take all the values you assign a code to those values and this mapping of values and code is stored at the beginning of the row group and for every value you only use the code. Now if you want to filter the data then you can check in the dictionary for the row groups if your predicate matches the row group and then it just reads that data. So the performance penalty for adding dictionaries in this case is not that high it's just about 1% of reading of that data and it's especially helpful if you have not sorted the data and you still want to benefit from partial reading of the data. So whenever you have distributed systems or when you work with multiple customers that produce data it makes sense to split the data into multiple files. Usually this is done by high partitioning scheme so as you can see in the above screenshot we have a parquet file with name date and this is partitioned by date and location and then you have multiple parquet files for it. So Dask can be used to create this type of structures and Dask also can sense this type of structures and auto detect these structures and you can also then pass a predicate with predicate pushdown and then it will read a data only from that particular file. So you can also read data with Dask from various cloud storages. So this is an example we are on Azure so you just need to pass a storage option and then you can directly read the files. So these are some benchmarks that we have obtained on NYC Taxi data. So the original file is 2.54 GB uncompressed CSV and if you write it to parquet if you see the file size even the uncompressed parquet file is less than half of the size of CSV. And one important thing to note is this is without dictionary encoding. So with dictionary encoding you can reduce the file size more and also the difference in the size it becomes more significant when the size of data increases. So for example I have seen some benchmark results if the uncompressed CSV file is around 30 GB the parquet files are around 5 GB so there are groups who have obtained this. So these are the benchmark for reading it into data frame and so there are three different compression schemes that we have compared and if you don't really care about which compression scheme to use you can just go with snappy. So the usual data flow is you run a query in database you pass it in as a columnar form to odbc layer using pyodbc the odbc layer converts it into row form and then pandas or dusk make it and columnar again but ideally we don't want our ETL workflows to have direct access to database because it can be bottleneck. So one of the solution that people use is to bypass this odbc layer is to export this data into CSV file however with use of CSV file you basically lose all the data type information. So what has worked for us is we export the data from database in parquet file to a cloud storage and from there we run our machine learning models there is huge benefit of using cloud storage and parquet file as now we can really scale our models because cloud storage for example like Azure data lake it scales much better than SQL and also we don't have to synchronize our loading because every job is it inserts into a separate parquet files and we can work on that and also data in parquet files can be shared by different implementations like spark and we can also use query engines like high presto drill for analysis of data yeah thank you I'll take questions hi yeah yeah so my question is when we are talking about parquet so it is basically not a Hadoop ecosystem right yeah so if we have an existing cluster of say 100 or 150 nodes so and if you want to leverage dusk so do we have to install this dusk package on each node or only the edge node is enough sorry so dusk actually has its own clusters right when it depends how you want to use dusk so usually what we do is initially just for development you just use your own simple dusk cluster which is which should be your local and then afterward it depends on where you are running it so there it scales accordingly hello yeah hi so like that was a really good talk and thank you hello can you hear me yeah yeah yeah hi really thanks for the talk like that really like I really got to know about things such as Apache arrow and parquet and Apache arrow as such is not a very well talked about thing right the community has not supported it as much as it has supported us can park it right so my question is that you compared dusk to spark right and you compared like you presented dusk as a query engine as well as a compute or a processing engine right yeah but like I would like to know how the internals of dusk work maybe most probably like how it works on a scale such as MPP massive parallel processing and how does it parallel process files and mostly how the deployment in dusk work right and like where to deploy the dusk clusters like like if you see you can deploy like and the cloud options as well right like you can deploy spark clusters on AWS normal like using the map reduce and stuff so I would like to know about those things if you have experienced such things can you summarize and I think probably we can talk during the okay that is a broad topic and yeah yeah yeah so what can you can you summarize I'll just try just talk about the parallel processing of dusk okay yeah yeah thank you for the talk I'll definitely use of optimization for the workflow you suggested this is like you have an oracle data or something rather than using your no DPC connection put it to cloud but you know that increases the cost and most organization would not like to do that due to privacy and cost do you have any suggestion on the odbc layer level itself you suggested a CSV but then again we lose the properties is there any other technique that you tried that works slightly better than having just an odbc connection but also leveraging dusk and arrow so actually the cost for using cloud storage is not that much it's only when you try when you extract data that's that's when you are you are actually charged so it's not that much and to answer your question there are some packages called open source package called simple kv which actually address this problem of how do we extract data from with with pi odbc you can check simple kv yeah hi how can we compare par k to HDFI or what are the relative advantages between both of them yeah so so there are some benchmarks actually comparing par k and HDFI so if you see the the file size with HDFI is reduced so much I mean if if I would have put it into benchmark it would have been less than par k files but if you try to reading read data from that it's like it's more than the par k file and then again you don't get these these things like if you want to just read the data that you are interested from this all these things yeah because it's not in columnar storage format hi suppose like I have one single CSV which is huge in gigabytes right so is dusk is applicable for that right how to read like in memory like it's not a cluster suppose like I'm having a single node so is apache dusk is sorry apache arrow and dusk is suitable in this case yeah exactly so basically what you can do is you can read that CSV okay you can write it in par k and as you can see the file size will reduce okay and if you know data while while writing the data you you encode the dictionary for the data so all the categorical variables would be encoded then it will reduce the file file size more and then you can basically read whatever data you want okay yeah thank you so my question is like in Cassandra something we import the database with CSV files so pretty much straight forward right but park it is like somewhat encrypted with some these things so the right speed may reduce when we're importing at that places so how to overcome that thing the right speed right speed into the Cassandra or some databases while we are importing through files so when you when you are asking you are writing CSV files right now in Cassandra and if you use par k files right and then while reading the files or well while reading yeah while reading into Cassandra I don't think so it this there should be be a problem because what happens is the way the par k files are written as I explained while writing itself the file statistics and everything is encoded there so that's why it takes actually some time to write it and and that's the reason actually it's not using for streaming data also but then it's a very good fit if you don't have a streaming data and you you are interested in only reading some chunk of the data every time