Rating is available when the video has been rented.
This feature is not available right now. Please try again later.
Published on Jun 1, 2016
PyData Berlin 2016
Floto is an open source tool to programmatically author, schedule and run scalable data pipelines using AWS Simple Workflow - without the need to maintain a master server or queue or the state of workers.
There are quite a few great tools for building effective and robust distributed data processing pipelines, especially Luigi from Spotify and Airflow from AirBnB.
For scaling out, they all require a queue or master server, though. And those need maintenance.
We wrote floto (github.com/babbel/floto), an open source tool to programmatically author, schedule and run scalable data pipelines on AWS - without the maintenance overhead.
It uses AWS Simple Workflow, but I'll talk most about some general topics regarding data workflow orchestration: - separation of concerns - managing complexity through dependency reduction - idempotent (or re-runnable) jobs - transactional jobs (either completely fail, or completely succeed) - failures and reruns - evolving changes - organizational scaling - heterogenous systems