"Expressing complex data aggregations with Histogrammar" by Jim Pivarski





The interactive transcript could not be loaded.


Rating is available when the video has been rented.
This feature is not available right now. Please try again later.
Published on Sep 17, 2016

Since the 1970's, data analysis in high energy physics has revolved around the histogram: an array of integers approximating a distribution. Fortran codes drawing ASCII-art plots bear a strong resemblance to the analysis scripts that discovered the Higgs boson: imperative for-loops filling histogram objects.

If this sounds cumbersome, it is. Explicit for-loops must be manually edited to add concurrency. However, the histogram concept itself is powerful: many data visualizations can be constructed by cleverly filling suites of related histograms, adding them, subtracting them, and dividing them bin by bin.

Today, high-energy physics analysis is colliding with tools from the Big Data community. In my work with physicists adopting Apache Spark, I've found that histograms can benefit from a functional style, accepting fill rules as lambda functions, and they can be subdivided into more fundamental units.

In fact, all the tricks for building complex data visualizations with histograms can be formalized as a grammar of "aggregator monoids." These aggregators are associative for easy concurrency and simple enough to implement in many languages.

In this talk, I'll show you how to use Histogrammar, a lightweight, cross-language suite of histogramming primitives. Examples will include cluster-wide histogramming in Spark and tapping intermediate values in a GPU computation. It is my belief that the Big Data community can learn as much from physicists as physicists are from Big Data.


When autoplay is enabled, a suggested video will automatically play next.

Up next

to add this to Watch Later

Add to

Loading playlists...