Rating is available when the video has been rented.
This feature is not available right now. Please try again later.
Published on May 30, 2015
Apache Spark is a computational engine for large-scale data processing. It is responsible for scheduling, distribution and monitoring applications which consist of many computational task across many worker machines on a computing cluster.This talk will give an overview of the PySpark DataFrame API. While Spark core itself is written in Scala and runs on the JVM, PySpark exposes the Spark programming model to Python. The Spark DataFrame API was introduced in Spark 1.3. DataFrames envolve Spark's Resiliant Distributed Datasets model and are inspired by Pandas and R data frames. The API provides simplified operators for filtering, aggregating, and projecting over large datasets. The DataFrame API supports diffferent data sources like JSON datasources, Parquet files, Hive tables and JDBC database connections.