Loading...

Anne Matthies - Zero-Administration Data Pipelines using AWS Simple Workflow

596 views

Loading...

Loading...

Transcript

The interactive transcript could not be loaded.

Loading...

Rating is available when the video has been rented.
This feature is not available right now. Please try again later.
Published on Jun 1, 2016

PyData Berlin 2016

Floto is an open source tool to programmatically author, schedule and run scalable data pipelines using AWS Simple Workflow - without the need to maintain a master server or queue or the state of workers.

There are quite a few great tools for building effective and robust distributed data processing pipelines, especially Luigi from Spotify and Airflow from AirBnB.

For scaling out, they all require a queue or master server, though. And those need maintenance.

We wrote floto (github.com/babbel/floto), an open source tool to programmatically author, schedule and run scalable data pipelines on AWS - without the maintenance overhead.

It uses AWS Simple Workflow, but I'll talk most about some general topics regarding data workflow orchestration:
- separation of concerns
- managing complexity through dependency reduction
- idempotent (or re-runnable) jobs
- transactional jobs (either completely fail, or completely succeed)
- failures and reruns
- evolving changes
- organizational scaling
- heterogenous systems

Comments are disabled for this video.
When autoplay is enabled, a suggested video will automatically play next.

Up next


to add this to Watch Later

Add to

Loading playlists...