Loading...

Jonathan Dinu: Scalable Pipelines with Luigi or: I’ll have the Data Engineering, hold the Java!

10,532 views

Loading...

Loading...

Rating is available when the video has been rented.
This feature is not available right now. Please try again later.
Published on Aug 5, 2015

PyData Seattle 2015
In this workshop you see how (and why) to leverage the PyData ecosystem to build a robust data pipeline. More specifically you will learn how to use the Luigi framework to integrate multiple stages of a model building pipeline (collection, processing, vectorization, training of multiple models, and validation) all in Python!

As companies scale prototypes and ad hoc analyses into production systems, it is critical to build automated (and repeatable) systems for data collection/processing and model training /evaluation which are fault tolerant enough to adapt to changing constraints. Sustainable software development is often an afterthought for data scientists, especially since the tools for analysis (R, scientific python, etc.) do not naturally lend themselves to building scalable and extensible software abstractions. But now we can have our cake and eat it too... all with Python!

In this workshop you see how (and why) to leverage the PyData ecosystem to build a robust data pipeline. More specifically you will learn how to use the Luigi framework to integrate multiple stages of a model building pipeline: collection, processing, vectorization, training of multiple models, and validation.

Outline:
The basic components of a data pipeline (5min)
What and Why Luigi (10min)
Lab: The Smallest (1 stage) pipeline (15min)
Managing dependencies in a pipeline (10min)
Lab: Multi-stage pipeline and introduction to the Luigi Visualizer (15min)
Serialization in a Data Pipeline (10min)
Lab: Integrating your pipeline with HDFS and Postgres (20min)
Scheduling (10min)
Lab: Parallelism and recurring jobs with Luigi (20min)
Wrap up and next steps (5min)

Materials available here:
Github Repo: https://github.com/Jay-Oh-eN/data-eng...
Slides: http://www.slideshare.net/jonathandin...

Comments are disabled for this video.
When autoplay is enabled, a suggested video will automatically play next.

Up next


to add this to Watch Later

Add to

Loading playlists...