Stephen Simmons - Pandas from the Inside / "Big Pandas"





The interactive transcript could not be loaded.


Rating is available when the video has been rented.
This feature is not available right now. Please try again later.
Published on Jul 26, 2017

Pandas is great for data analysis in Python. It promises intuitive DataFrames from R; speed like numpy; groupby like SQL. But there are plenty of pitfalls. This tutorial looks inside pandas to see how DataFrames actually work when building, indexing and grouping tables. You will learn how to write fast, efficient code, and how to scale up to bigger problems with libraries like Dask.

Pandas is great way to quickly get started with data analysis in Python: intuitive DataFrames from R; fast numpy arrays under the hood; groupby just like SQL. But this familiarity is deceptive and both new and experienced pandas users often get stuck on things they feel should be simple.

In the first part of this tutorial, we look inside pandas to see how DataFrames actually work when building, indexing and grouping tables. We will learn which pandas operations are fast and why, and how to avoid common performance pitfalls. By the end of the tutorial, you will develop a strong and reliable intuition about using pandas effectively.

In the second part, we switch gear to bigger problems where our data sets can't fit in local memory. First we see how pandas behaves as we start to hit memory limits. Then we look at Dask, whose distributed/deferred DataFrames are a near drop-in replacement for pandas. Then we come back to pure pandas and look for ways to manage bigger datasets with clever data storage,.

During this tutorial, you are welcome to follow along on your laptop with the sample data sets and example code in a Jupyter notebook. These will be made available on GitHub here just before the tutorial. The code targets Python 3 and the latest pandas/dask release:


PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R.

PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.

Comments are turned off
When autoplay is enabled, a suggested video will automatically play next.

Up next

to add this to Watch Later

Add to

Loading playlists...