Peter Hoffmann: Indroduction to the PySpark DataFrame API




Rating is available when the video has been rented.
This feature is not available right now. Please try again later.
Published on May 30, 2015

Apache Spark is a computational engine for large-scale data processing. It is responsible for scheduling, distribution and monitoring applications which consist of many computational task across many worker machines on a computing cluster.This talk will give an overview of the PySpark DataFrame API. While Spark core itself is written in Scala and runs on the JVM, PySpark exposes the Spark programming model to Python. The Spark DataFrame API was introduced in Spark 1.3. DataFrames envolve Spark's Resiliant Distributed Datasets model and are inspired by Pandas and R data frames. The API provides simplified operators for filtering, aggregating, and projecting over large datasets. The DataFrame API supports diffferent data sources like JSON datasources, Parquet files, Hive tables and JDBC database connections.

Peter Hoffmann

Comments are disabled for this video.
When autoplay is enabled, a suggested video will automatically play next.

Up next

to add this to Watch Later

Add to

Loading playlists...