Demo of DataPad visual analytics application showcasing improved performance with columnar data structures.
Demo of DataPad visual analytics application showcasing improved performance with columnar data structures.

Ten Reasons Why I Grew to “Hate” Pandas (and What We’re Doing About It)

This article is the first in a series exploring Apache Arrow, pandas, pandas2, and the ongoing evolution of my work in data science. It’s a detailed and technical piece, but if you’re interested in the future of data analysis in Python, I encourage you to delve into it.

Here, I aim to clearly explain some of the critical challenges within pandas’s internal workings and how I’ve been systematically planning and developing practical solutions. The projects I’ve been involved in – pandas, Badger, Ibis, Arrow, Feather, Parquet – might seem unrelated at first glance. However, they are deeply interconnected, forming a continuous line of work that began nearly a decade ago.

Consider supporting pandas development with a tax-deductible donation.

Background: My “Love-Hate” Relationship with Pandas

I started building pandas in April 2008 as a side project, working on it during nights and weekends. Back then, my software engineering skills and understanding of Python’s scientific computing ecosystem were limited. My initial code was, to put it mildly, not elegant or efficient. I learned through trial and error, absorbing knowledge from others as much as possible. My serious foray into C development began in 2013, followed by C++ in 2015. Today, I have a much greater appreciation for C++ than I did nine years prior.

In 2008, Python was not the data science powerhouse it is today. The problems pandas solves so effectively now were typically addressed using tools like R, SAS, SPSS, Stata, or MATLAB. Python was not the go-to choice for data analysis at that scale.

Therefore, it’s perhaps not surprising that pandas’s internal architecture has accumulated some imperfections over time. In the summer of 2011, I created the BlockManager, a memory management system using NumPy arrays to handle the internal data columns of a pandas.DataFrame. I even wrote about it back in July 2011, outlining a roadmap for enhanced scientific data structures in Python.

While the BlockManager and pandas’s close integration with NumPy have been beneficial to the project’s history, they are now significant sources of problems for pandas users dealing with increasingly large datasets.

Simply put, in 2011, we weren’t anticipating the need to analyze datasets of 100 GB or 1 TB. Currently, my rule of thumb for pandas is to have 5 to 10 times the RAM capacity as the size of your dataset. So, for a 10 GB dataset, ideally, you should have 64 GB, or preferably 128 GB of RAM to ensure smooth operation and avoid memory issues. This often surprises users who expect to work with datasets only slightly smaller than their computer’s RAM.

Pandas Rule of Thumb: Allocate 5 to 10 times more RAM than your dataset size.

Furthermore, there are hidden memory consumers within the project, such as the use of Python objects (like strings) for various internal operations. This can lead to a 5GB dataset on disk expanding to 20GB or more in memory. This situation is far from ideal when dealing with large datasets.

DataPad, Badger, and Lessons from Cloudera: Seeds of “Hate”

In 2013, I co-founded DataPad with Chang She, a longtime friend and pandas collaborator. Our goal was to leverage the emerging PyData stack to power our visual analytics application. However, we encountered significant performance bottlenecks, particularly in cloud environments. The responsiveness of analytics queries in DataPad was not satisfactory with standard pandas configurations.

To address this, I streamlined the pandas feature set to the absolute essentials and developed a new, compact implementation called Badger. I discovered that by utilizing contiguous, immutable columnar data structures optimized for data locality, performance could be improved by 2 to 20 times across various operations. String processing saw the most substantial gains, but improvements were widespread. You can see a demo of DataPad here.

Badger was undeniably “startup code.” When Cloudera acquired us in 2014, I considered open-sourcing Badger. However, cleaning up the code—mostly written in C with an abundance of macros—for public consumption seemed like a substantial undertaking. My focus shifted to building a more future-proof solution that would remain relevant a decade later. Releasing Badger in its current state would have been a distraction for pandas users, and I wasn’t inclined to continue developing that specific codebase. Abandoning a released codebase is generally not a good practice. Given the need for a complete rewrite, Badger was shelved.

In November 2013, I delivered a talk titled 10 Things I Hate About Pandas, which has garnered nearly 100,000 views over four years. It summarized my learnings and “battle scars” from the first five years of pandas development throughout 2013.

The 10 (actually 11) points, paraphrased, are:

  1. Internals too far removed from “the metal.”
  2. No support for memory-mapped datasets.
  3. Poor performance in database and file ingest/export.
  4. Problematic missing data support.
  5. Lack of transparency in memory usage and RAM management.
  6. Weak support for categorical data.
  7. Complex groupby operations are awkward and slow.
  8. Appending data to a DataFrame is tedious and very costly.
  9. Limited, non-extensible type metadata.
  10. Eager evaluation model, no query planning.
  11. “Slow,” limited multicore algorithms for large datasets.

Badger had started to address some of these issues, but its solutions were narrowly focused on DataPad’s specific needs. My move to Cloudera provided an opportunity to learn from database and big data system developers.

At Cloudera, I examined systems like Impala, Kudu, Spark, and Parquet. Since Python and pandas had not been involved in their development, integrating with them was challenging. The biggest hurdle was data interchange, specifically efficiently moving large tabular datasets between different processes’ memory spaces. It was incredibly inefficient, and no standard solution existed. RPC-oriented serialization protocols like Thrift and Protocol Buffers were too slow and overly general.

Analyzing the integration points between these systems, I noticed similarities to the problems I had tackled in Badger. Zero-copy data access was paramount; the ability to memory-map complex tables was crucial for accessing terabyte-scale datasets as quickly and easily as megabyte-scale datasets.

By early 2015, I was advocating for a “columnar data middleware” offering zero-copy access, robust support for strings, nested types, and the complex, JSON-like data prevalent in real-world scenarios. Similar to Badger, this format needed to be optimized for data locality to maximize query evaluation speeds.

I was fortunate to connect with like-minded individuals across various big data projects, particularly from Apache Drill, Impala, Kudu, Spark, and others. In late 2015, to establish a neutral, vendor-agnostic platform for collaboration, we partnered with the Apache Software Foundation to launch Apache Arrow.

On paper, Apache Arrow embodied everything I had envisioned for years. However, in late 2015, all that existed (for Python) were Markdown specification documents, still under development. We initiated the Apache project to foster community dialogue about the specifications and the problems Arrow aimed to solve. We needed to build functional software to realize its potential. After nearly two years of dedicated work, we’ve made significant strides in achieving our initial goals.

I firmly believe that Arrow is a cornerstone technology for the next generation of data science tools. I recently outlined this vision in my JupyterCon keynote.

Also in late 2015, I authored a comprehensive set of design documents to initiate discussions about building a faster, cleaner core pandas implementation, potentially named pandas2. Pandas is a community project governed by consensus, with me as the BDFL to resolve deadlocks. I wanted to gauge the core developers’ agreement on my assessment of pandas’s internal shortcomings. Over the past two years, there’s been broad consensus on the issues, but finding solutions that don’t disrupt the existing pandas user base remains an ongoing challenge. My focus has been on building computational infrastructure, largely unseen by pandas users, that will pave the way for these improvements.

Does Arrow Alleviate My “Hate”? Addressing the “10 Things”

Arrow doesn’t completely resolve all “10 things” yet, but it has made substantial progress.

Arrow’s C++ implementation provides critical in-memory analytics infrastructure for projects like pandas:

  • A runtime column-oriented memory format optimized for analytical processing performance.
  • A zero-copy, streaming/chunk-oriented data layer designed for high-speed data movement and access.
  • Extensible type metadata to describe diverse flat and nested data types found in real-world systems, including support for user-defined types.

Currently, the Arrow C++ project is still developing:

  • A comprehensive analytical function “kernel” library.
  • Logical operator graphs for graph dataflow-style execution (similar to TensorFlow or PyTorch, but for data frames).
  • A multicore scheduler for parallel evaluation of operator graphs.

I will elaborate on the roadmap for building an analytics engine for Arrow memory, applicable to projects like pandas, in a future post.

Let’s delve deeper into the “10 Things” and how Arrow is addressing them, mitigating my “hate” for these pandas limitations.

1. Getting Closer to the Metal: No More Python Object Bloat

In Arrow, all column memory, whether for strings, numbers, or nested types, is arranged in contiguous memory buffers. This design optimizes performance for both random access (single values) and scans (sequential values). The goal is to minimize CPU or GPU cache misses when iterating through data in a table column, even with strings or non-numeric data.

In pandas, a string array is actually an array of PyObject pointers. The actual string data resides in PyBytes or PyUnicode structs scattered across the process heap. This object-heavy approach severely limits performance. In Python, a simple string like 'wes' consumes 52 bytes of memory, while '' uses 49 bytes. For a detailed analysis of these issues, see Jake Vanderplas’s insightful explanation on Why Python is Slow.

In contrast, Arrow stores strings contiguously in memory, allowing for efficient scanning of entire string columns without cache misses. Processing contiguous bytes directly “against the metal” is guaranteed.

Arrow’s C/C++ API also enables applications unaware of Python to produce and consume native Arrow tables, sharing them in-process or via shared memory/memory maps. Pandas’s lack of a C or Cython API for data frames has been a long-standing problem.

2. Memory Mapping Huge Datasets: Breaking the RAM Barrier

Perhaps the most significant memory management issue in pandas is the requirement to load entire datasets into RAM for processing. Pandas’s complex BlockManager is impractical for memory-mapping scenarios. This forces an unavoidable conversion and copy operation every time a pandas.DataFrame is created.

Arrow serialization includes a “data header” that details the exact locations and sizes of all memory buffers for every column in a table. This allows memory mapping of datasets larger than RAM and performing pandas-style algorithms in-place, eliminating the need to load them entirely into memory. You could read just 1 MB from a 1 TB table, incurring only the cost of those specific random reads. With modern solid-state drives, this is a highly efficient approach.

Arrow’s memory-mapping capability also allows multiple processes to operate on the same large dataset without any data movement or copying. This is effectively utilized in the Plasma Object Store (now part of Arrow) within the Ray project at UC Berkeley.

3. High-Speed Data Ingest and Export: Bypassing IO Bottlenecks

Arrow’s efficient memory layout and rich type metadata make it an excellent container for data ingested from databases and columnar storage formats like Apache Parquet.

A fundamental Arrow concept is the “record batch stream”, a sequence of atomic tables constituting a large dataset. This streaming data model is ideal for databases that serve record streams from database cursors.

We have developed a high-speed connector with Parquet format. The optimized turbodbc project also provides high-performance ODBC-based database connections leveraging Arrow.

My goal is to build Arrow-native connectors for numerous other file formats and databases, including:

  • SQLite
  • PostgreSQL
  • Apache Avro
  • Apache ORC
  • CSV (a superior replacement for pandas.read_csv)
  • JSON

4. Doing Missing Data Right: Consistent and Efficient Null Handling

In Arrow, all missing data is represented using packed bit arrays, separate from the actual data. This ensures consistent and straightforward missing data handling across all data types. Furthermore, analytics on null bits (like AND-ing bitmaps or counting set bits) can be performed using fast bitwise hardware operators and SIMD.

Each array’s null count is explicitly stored in its metadata. If data contains no nulls, we can utilize faster code paths that bypass null checks. In pandas, the lack of native null support in NumPy necessitates extra null checks in most analytics, even when unnecessary, impacting performance. With Arrow, if no nulls exist, the bit array doesn’t even need to be allocated.

Because NumPy lacks native missing data support, we’ve had to create null-aware versions of many critical performance algorithms in pandas. Ideally, null handling should be integrated into all algorithms and memory management from the ground up, as it is in Arrow.

5. Keeping Memory Allocations in Check: Transparent Memory Management

In pandas, memory is managed by either NumPy or the Python interpreter, making it difficult to precisely track memory usage by a pandas.DataFrame. A single line of code can unexpectedly double or triple a process’s memory footprint due to temporary allocations, often leading to MemoryError exceptions.

Arrow’s C++ implementation meticulously tracks all memory allocations within a central “memory pool.” This provides real-time insight into Arrow memory usage. Using “subpools” with parent-child relationships allows precise measurement of “high water marks” in algorithms, revealing peak memory consumption during analytical operations. This technique, common in databases, helps monitor and limit memory usage during operator evaluation. If memory limits are approached, mitigation strategies like spilling to disk (leveraging Arrow’s memory-mapping capabilities) can be employed.

Arrow memory is either immutable or copy-on-write. This ensures clarity on buffer references, preventing unnecessary defensive copying.

6. Supporting Categorical Data Well: First-Class Categorical Types

When I gave my 2013 talk, pandas lacked the pandas.Categorical type, which was implemented later. However, pandas’s workarounds for data types not natively supported by NumPy have always been somewhat cumbersome. pandas.Categorical is not readily interoperable outside of pandas. While extension dtypes exist, their implementation feels somewhat bolted-on due to pandas’s tight coupling with NumPy.

In Arrow, categorical data is a core, first-class citizen. We’ve prioritized efficient and consistent representation in-memory, on-wire, and in shared memory. Arrow supports sharing categories (referred to as dictionaries) across multiple arrays.

Pandas also has other user-defined types, such as datetime with time zone and periods. Arrow aims to gracefully support logical data types (with specific physical memory representations) so systems can seamlessly transport data using Arrow without requiring modifications to the Arrow format specifications.

7. Better Groupby(…).apply Operations: Parallelism and API Improvements

Arrow facilitates easier parallelization of groupby operations. Due to the aforementioned pandas limitations, fully parallelizing df.groupby(...).apply(f) operations is currently challenging or impossible.

Future development will also focus on enhancing the API for complex apply operations in pandas, leveraging Arrow’s capabilities.

8. Appending to DataFrames: Zero-Copy Appends

In pandas, all data within a DataFrame column must reside in a single NumPy array. This constraint often results in memory doubling and extra computational overhead when concatenating Series and DataFrame objects.

Arrow C++ table columns can be chunked. Appending to an Arrow table becomes a zero-copy operation, requiring minimal computation or memory allocation. By designing for streaming, chunked tables from the outset, appending to existing in-memory tables becomes significantly less expensive than in current pandas. This chunked data design is also crucial for implementing out-of-core algorithms, laying the groundwork for processing datasets exceeding available memory.

9. Adding New Data Types: Flexible and Extensible Type System

Adding new data types in pandas involves multiple layers of complexity:

  • Defining new metadata.
  • Creating dynamic dispatch rules for operator implementations in analytics.
  • Preserving metadata throughout operations.

For instance, a “currency” type might be represented as a string for currency code, with the underlying data stored as float64 or decimal. This would allow treating currency values computationally like numbers while retaining currency metadata during numeric operations.

Metadata preservation rules can vary depending on the operator, adding complexity.

Arrow decouples metadata representation from computational details and metadata management. The C++ implementation is designed to accommodate user-defined types. A key goal for the analytics engine development is to enable user-defined operator dispatch and metadata promotion rules.

10/11. Query Planning, Multicore Execution: Efficiency and Parallelism Unleashed

When you execute df[df.c < 0].d.sum() in pandas, it creates a temporary DataFrame df[df.c < 0] and then sums the d column of this temporary object. If df has many columns, this is highly inefficient. Even df.d[df.c < 0].sum() generates a temporary Series before summing!

Knowing the entire expression beforehand allows for significant optimization and elimination of these temporary allocations. Many algorithms, including this example, can also be parallelized across multiple processor cores.

As part of building an analytics engine for Arrow, we plan to develop a lightweight physical “query planner” with a multicore in-process scheduler. This will enable efficient parallelization and evaluation of various algorithms. Drawing from existing work in graph dataflow execution (especially in machine learning, like TensorFlow and PyTorch), this involves creating a graph data flow engine where the fundamental data unit is an Arrow table.

To prepare for this, I initiated the Ibis project in 2015 (still under active development). Ibis aims to create a pandas-friendly deferred expression system for static analysis and compilation of these types of operations. Since an efficient multithreaded in-memory engine for pandas was not available then, I focused on building compilers for SQL engines (Impala, PostgreSQL, SQLite), similar to the R dplyr package. Phillip Cloud from the pandas core team has been a long-time active contributor to Ibis.

What’s Next? The Future is Arrow and pandas2

In my next blog post, I will delve deeper into the roadmap for developing an Arrow-native multithreaded in-memory execution engine and its relevance to the architecture of pandas2. This engine promises to address many of the “reasons why I hate” the current limitations and usher in a new era of performance and scalability for data analysis in Python.

Addendum: On Dask – A Helpful Band-Aid, Not a Cure

Many people ask about Dask (and Spark, and similar projects) and its role in improving pandas performance and scalability. Dask certainly provides valuable benefits, including:

  • Partitioning large datasets and processing them in separate threads or processes.
  • Managing pandas data in RAM more efficiently by evicting less-needed data.

Dask simplifies tasks like reading directories of CSV files in parallel using pandas.read_csv and performing groupby operations on the combined dataset. Matt Rocklin and his team have created an exceptional tool.

However, Dask treats pandas as a black box. dask.dataframe doesn’t fundamentally solve pandas’s inherent performance and memory issues. It distributes these problems across multiple processes and mitigates them by carefully managing data piece sizes to avoid MemoryError exceptions.

Some problems are not well-suited to Dask’s partition-parallel distributed task execution model. Furthermore, pandas’s memory management and IO challenges make Dask jobs slower than they could be with a more efficient in-memory runtime like what Arrow aims to provide. Dask is a helpful tool within the existing pandas ecosystem, but the true long-term solutions lie in architectural improvements like those offered by Apache Arrow and pandas2.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *