An efficient workflow for reproducible science; SciPy 2013 Presentation





The interactive transcript could not be loaded.



Rating is available when the video has been rented.
This feature is not available right now. Please try again later.
Published on Jul 1, 2013

Authors: Bekolay, Trevor, University of Waterloo

Track: Reproducible Science

Every scientist should be able to regenerate the figures in a paper. However, all too often the correct version of a script goes missing, or the original raw data is filtered by hand and the filtering process undocumented, or the student who has the data or code has switched labs.

In this talk, I will describe a workflow for a complete end-to-end analysis pipeline, going from raw data to analysis to plotting, using existing tools to make each step of the pipeline reproducible, documented, and efficient, while requiring few sacrifices in terms of a scientist's time and effort.

The key insight is to decouple each analysis step and each plotting step, in order to do several analyses or plots in parallel. Each step can be cached if it is costly, with the code that produces the cached data serving as the documentation for how it is produced.

I will discuss a way to organize code in order to make analyzing and plotting large data sets efficient, parallelizable, and cacheable. Once completed, source code can be uploaded to a hosting service like Github or Bitbucket, and data can be uploaded to a data store like Amazon S3 or figshare. The end result is that readers can completely regenerate the figures in your paper at no or nearly no cost to you.


When autoplay is enabled, a suggested video will automatically play next.

Up next

to add this to Watch Later

Add to

Loading playlists...