Python Pandas Ditches NumPy for Speedier PyArrow

2 days ago 2

Python Pandas is about to get a performance boost: When the long-awaited version 3.0 of the data analysis library is released, it will come with a faster engine for loading and reading columnar data. PyArrow will take the place of NumPy, the math library Pandas has used thus far.

At present, Pandas already supports PyArrow, and has done so at least since version 2 released in April 2023. And in the next version, v3.0, PyArrow will be a required dependency, with pyarrow.string being the default type inferred for string data.

“So the good news, PyArrow is 10 times faster. What else do you need to know? Like it’s just really ridiculously faster,” advised Python instructor Ruben Lerner, during a session about the “PyArrow Revolution,” held at PyCon 2025 earlier this month in Pittsburgh.

The Way of the Pandas

Created in 2008 by financial quant Wes McKinney, the Pandas library is now used by many to manage large data sets. He originally built it on top of the NumPy scientific computing library, which, among other features, offers the ability to store large arrays of data in a great variety of formats.

A Pandas Series is basically a wrapper around a one-dimensional NumPy array; a Pandas Data Frame is a wrapper around a two-dimensional data array. Because it is written in C and vectorized, Pandas does so in a manner faster and more efficient than Python itself.

But NumPy, built in 2005 as an update to the Numeric library, predated a lot of data concerns in the past decade, such as data streaming, or nested rows or use of complex data types. It has trouble with dates; it has no compression techniques and is not even that great for batch processing.

Worst of all, it is slow with columnar data, which when you think about, is basically what arrays are. It still stores everything in rows, which makes array processing painfully slow as it tracks down each value on a case-by-case basis. And it is single threaded, so it does all calculation serially, limited by the speed of the processor.

Introducing PyArrow

PyArrow offers columnar storage, which eliminates all that computational back and forth that comes with NumPy. PyArrow paves the way for running Pandas, by default, on Copy on Write mode, which improves memory and performance usage.

PyArrow is the Python bindings for the Apache Arrow. Also created by McKinney, Apache Arrow is a cross-platform memory format that stores data in columns, making them easy to store on disk and faster to calculate.

The columnar orientation provided faster data writes and reads for most open source data processing engines, including Spark, Flink, Dremio, Drill and Ray. A lot of AI modeling is built on columnar data, so the format is much favored by AI frameworks such as TensorFlow and PyCharm.

Lerner offered an example of how bad NumPy is at memory, using a 2.2GB CSV (Comma Separated Values) of New York parking violations for the year 2020, which consists of about 12 million rows.

Reading in that CSV file into memory would take Python 55.8 seconds, but PyArrow did the work in 11.8 seconds.

Arrow defines two new binary formats to speed data exchanges even more. One is Feather, which is an uncompressed data format, and the other is Parquet, which compresses data.

That 2.2GB CSV file took up only 1.4GB when rendered into the Feather format and 379MB in Parquet. Because the data is binary, Pandas doesn’t have to pause to figure out the data type, Lerner noted.

Performance increased as well: With Feather, that entire CSV file could be read in 10.6 seconds, and with Parquet it took only 9.1 seconds, according to Lerner’s tests.

The Release of Pandas 3.0

When will Pandas 3.0 arrive, however, is still an open question. Pandas 3.0 was originally due to be released in April 2024, which came and went with no release, and of press time, no scheduled release. The latest releases, v. 2.23, was issued in September.

In many ways, Pandas is by now a legacy technology, so the embedding of PyArrow is good news for organizations that want to speed data-crunching operations with all the messy work of migrating to a new platform.

“The real advantage here is that you get to keep your use of Pandas, keep the same API,” Lerner said. “You swap out the backend in favor a new one, and voila, you save tons of time and tons of memory.”

YOUTUBE.COM/THENEWSTACK

Tech moves fast, don't miss an episode. Subscribe to our YouTube channel to stream all our podcasts, interviews, demos, and more.

Read Entire Article