PySpark: Python API for Spark





The interactive transcript could not be loaded.


Rating is available when the video has been rented.
This feature is not available right now. Please try again later.
Published on Mar 2, 2013

UC Berkeley AmpLab member Josh Rosen, presents PySpark. PySpark is the new Python API for Spark which is available in release 0.7 This presentation was given at the Spark meetup at Conviva in San Mateo, Ca on Feb 21st 2013. Download here http://spark-project.org/downloads/

00:33 What is Spark?
03:00 What is PySpark?
03:45 Example Word Count
04:35 Demonstration of interactive shell on AWS EC2
06:22 tracking time elapsed, %time berkeley_pages.count()
06:37 Spark web interface
09:14 Distributing data, sc.parallelize
11:20 API documentation
11:27 Python doctest, create tests from interactive samples
11:58 Example kmeans.py, k-means clustering
12:39 Getting help help(sc)
13:00 Example wordcount.py
13:18 PySpark Implementation details
14:15 PySpark less than 2K lines including comments
17:18 Pickled Objects, RDD[Array[Byte]]
17:44 Batching Pickle to reduce overhead
18:00 Consolidating operations into single pass when possible
19:27 PySpark Roadmap, adding sorting support, file formats such as csv, PyPy JIT


When autoplay is enabled, a suggested video will automatically play next.

Up next

to add this to Watch Later

Add to

Loading playlists...