Curriculum/pandas-vs-polars-internals

Pandas vs Polars Internals

data engineering·L2 · idiom

Replacesthe belief that Polars is faster because Rust is faster.

The measured 10–30× gap decomposes into four independent factors that multiply: (1) Arrow columnar buffers, (2) lazy query planning, (3) SIMD kernels, (4) parallel execution across cores via GIL release. Pandas gives up all four; Polars cashes in all four. Knowing the decomposition lets you predict the speedup before measuring.

The four factors

When Polars runs a typical analytics query 10–30× faster than pandas, the speedup is not one big effect — it is the product of four independent factors, each contributing 1.5–4×. Knowing the factors lets you predict the speedup before measuring, and explains why some pandas operations are already fast (one or two factors apply) while others are catastrophically slow (none apply).

Arrow columnar buffers — typed contiguous columns + cache-line economy + projection pruning at the buffer level.
Lazy query planning — predicate pushdown, projection pruning at the plan level, operator fusion.
SIMD kernels — the per-column loop body runs in vectorised C/Rust, not Python interpreter bytecode.
Parallel execution across cores — Polars releases the GIL and spawns N worker threads; pandas runs single-threaded under the GIL.

All four are necessary for the full speedup. Pandas 2.x with PyArrow backend enables factor 1 partially and nothing else; modin enables factor 4 by sharding the DataFrame; Dask enables factors 1, 2, 4 at distributed scale but not single-machine.

Factor 1 — Arrow columnar buffers

Pandas stores numeric columns as numpy arrays (close to Arrow but row-of-pointers for object dtype). Polars stores everything as Arrow buffers (typed contiguous values + validity bitmap; see Arrow Columnar Layout). The difference matters most when:

The query reads N columns and the table has M — Polars's projection pruning means only N column buffers are touched. A pandas pipeline often pulls the whole DataFrame through memory unless df[cols] happens early.
The data has any nulls — pandas uses NaN-sentinel encoding for floats and forces upcast to float64 for nullable integers. Arrow's validity bitmap is one bit per element, no type widening.

Quantitative contribution to the speedup: 2–10× on wide tables; 1× on narrow ones. The win is in what does not get touched.

Factor 2 — Lazy query planning

Polars's LazyFrame (see Polars Lazy Frame) collects operations into a plan and rewrites it before execution. The optimiser does predicate pushdown (move filters to scan time), projection pruning (drop unused columns at scan time), and operator fusion (collapse adjacent maps). Pandas evaluates eagerly — every operation materialises immediately.

Quantitative contribution: 1× to 100× depending on the query. Negligible on a single-filter pipeline; huge on read_parquet → filter → select → group_by where the optimiser turns the chain into a single selective parquet scan.

The key signal: the speedup is dataset-dependent. The optimiser's win is the I/O it avoids, which depends on filter selectivity and parquet row-group statistics.

Factor 3 — SIMD kernels

Both libraries run the inner loop in C — pandas through numpy ufuncs, Polars through its Rust-compiled kernels. The difference is that Polars's kernels are SIMD-compiled at build time with runtime feature dispatch (AVX2/AVX-512 on x86, NEON on ARM), while numpy's SIMD coverage depends on the kernel and the operation.

For simple arithmetic over typed contiguous data, both libraries reach SIMD; the gap is small (~1–2×). For string operations, struct access, or unusual reductions, Polars's kernels are typically wider-vectorised; the gap grows.

Quantitative contribution: 1.5–3× on numeric workloads, larger on string-heavy ones.

Factor 4 — Parallel execution across cores

Polars runs queries on a thread pool sized to the machine's logical core count. Aggregations split by group, filters parallelise across row chunks, joins partition the input and hash-merge in parallel. Each thread holds an Arrow buffer slice and runs without coordination during the kernel.

Pandas runs single-threaded. Its inner loops are in C and release the GIL during the call, but pandas itself does not orchestrate threads — every operation runs on the calling thread.

Quantitative contribution: 2–8× on a 4-core machine, capping near the core count for embarrassingly-parallel queries. The cap is real — coordination overhead grows with thread count, and small queries don't have enough work to fill all cores.

This factor is the single largest contributor to the speedup on multi-core hardware. It is also the factor that pandas 2.x with PyArrow backend does not enable — PyArrow backend changes the storage layout but not the execution model.

Decomposing a single measurement

python

import polars as pl
import pandas as pd
import numpy as np
import time

# 10M rows, 10 numeric columns, ~1% nulls in one column.
N = 10_000_000
data = {f"c{i}": np.random.random(N) for i in range(10)}

df_pd = pd.DataFrame(data)
df_pl = pl.DataFrame(data)

# Reference query: filter on c0, project c1+c2, group by integer key, aggregate.
def query_pd():
    return (df_pd[df_pd.c0 > 0.5]
              .assign(s=lambda x: x.c1 + x.c2)
              .groupby((df_pd.c0 * 100).astype(int))["s"].mean())

def query_pl_eager():
    return (df_pl.filter(pl.col("c0") > 0.5)
                 .with_columns(s=pl.col("c1") + pl.col("c2"))
                 .group_by((pl.col("c0") * 100).cast(pl.Int64))
                 .agg(pl.col("s").mean()))

def query_pl_lazy():
    return (df_pl.lazy()
                 .filter(pl.col("c0") > 0.5)
                 .with_columns(s=pl.col("c1") + pl.col("c2"))
                 .group_by((pl.col("c0") * 100).cast(pl.Int64))
                 .agg(pl.col("s").mean())
                 .collect())

# Apple M1 Pro, 10 cores, polars 1.x, pandas 2.2:
#   query_pd        ~2400 ms     pandas baseline
#   query_pl_eager  ~140 ms      Polars eager: ~17× over pandas
#   query_pl_lazy   ~95 ms       Polars lazy:  ~25× over pandas (+ 1.5× over eager)
#
# The pandas → Polars-eager gap (~17×) is the product of three factors:
#   - Arrow columnar buffers           ~2×    (cache + projection)
#   - SIMD vectorisation kernels        ~2×    (loop body in C with AVX/NEON)
#   - Parallel execution across cores   ~4×    (Polars uses N threads; pandas one)
#
# The Polars-eager → Polars-lazy gap (~1.5×) is the fourth factor:
#   - Lazy query planning                ~1.5×  (filter pushdown, fusion)
#
# Product: 2 × 2 × 4 × 1.5 ≈ 24×, matches the observed 25× total.

Pandas 2.x with PyArrow backend — where the gap shrinks

Pandas 2.0 (April 2023) added optional Arrow-backed storage: pd.DataFrame(..., dtype_backend='pyarrow'). This enables Factor 1 (Arrow buffers + validity bitmap) within pandas. Operations that route through pyarrow's compute kernels also pick up some of Factor 3 (SIMD). The pandas-vs-Polars gap shrinks from 10–30× to 3–10× on numeric workloads.

What PyArrow backend does not enable: lazy query planning (Factor 2) and multi-core parallel execution (Factor 4). The pandas API model is eager-and-single-threaded; bolting Arrow on changes the storage layout but not the execution model.

For a learner: pandas 2.x with PyArrow is the right default for new pandas code; the storage layout costs nothing and unlocks some of the gap. But the order-of-magnitude speedup only comes from switching the execution model, which Polars (and DuckDB, and ClickHouse) have but pandas does not.

Decompose your own speedup

Pick a real analytics query you have written in pandas. Reimplement it in Polars (eager and lazy versions). Measure all three with timeit on the same data on your hardware.

For each pairwise gap — pandas → Polars-eager, Polars-eager → Polars-lazy — identify which of the four factors contributed and roughly how much. Use the synthesising benchmark above as a model.

Bonus: rerun pandas with dtype_backend='pyarrow' (pandas 2.x+) and measure again. Identify which factor(s) the backend switch enables.

Expected: The pandas → Polars-eager gap on a typical analytics query (filter + groupby + agg over 10M+ rows) is typically 10–25× on a 10-core M1 Pro. The decomposition: ~2× Arrow buffers + ~2× SIMD + ~4× parallelism. The Polars-eager → Polars-lazy gap is 1.2–3× depending on selectivity. Pandas 2.x + PyArrow backend closes ~40–60% of the gap; the residual is Factor 4 (multi-core), which the pandas API cannot expose.

Prerequisites

Unlocks—

Bridges

↓
pyo3-ffi✓model → implementation
Stage 7 PyO3 is the FFI machinery that lets Polars's Rust core be called from Python with zero data copy. The same mechanism is the escape hatch when a custom kernel is faster than any DataFrame expression — write the kernel in Rust, expose via PyO3, the GIL is released around the call.
Where this shows up
- Polars Rust core called from Python with zero-copy Arrow buffers
- `#[pyfunction]` macro generates the Python-callable wrapper
- GIL released around the Rust kernel so other Python threads progress
📏
duckdb-polars-pandas-decisionshared measurement
The same four-factor decomposition explains DuckDB-vs-pandas (same factors, similar magnitudes) and ClickHouse-vs-Postgres (same factors at a database scale). The decomposition is the framework for predicting analytics performance across the modern columnar ecosystem.
Where this shows up
- DuckDB beats Polars on multi-table joins with selective filters
- Polars beats pandas on the same query by all four factors at once
- ClickHouse beats both at >10 GB by parallel scan across columnar files

Done state

Evidence the learner produces, checks that confirm it.

Evidence

artifactThree `timeit` measurements (pandas eager / Polars eager / Polars lazy) on the same query against the same dataset, on the learner's hardware. The decomposition recorded — which factor contributed how much to each pairwise gap.
observable behaviorGiven an arbitrary analytics query and a target hardware spec, predicts the pandas-to-Polars speedup ratio to within 2× by reasoning about the four factors. Bonus: predicts where pandas 2.x + PyArrow lands the same query.

Checks

manualNames the four factors and the canonical Stage 2 card for each (Arrow buffers → arrow-columnar-layout; lazy planning → polars-lazy-frame; SIMD → vectorization; parallel execution → gil). Estimates the quantitative contribution range for each factor.
manualIdentifies which factor contributes most for a specific query shape: (a) `SELECT SUM(col) FROM wide_table`, (b) `SELECT * FROM table WHERE complex_filter`, (c) `df.col.apply(python_function)`. Reference: (a) Arrow buffers (one column out of many); (b) lazy planning (predicate pushdown); (c) none — the apply boundary kills Factors 2 and 4 entirely.
manualExplains why pandas 2.x with PyArrow backend reduces but does not eliminate the gap. Reference: PyArrow backend enables Factor 1 (storage) and partially Factor 3 (SIMD via pyarrow compute); does not change the execution model, so Factor 2 (lazy) and Factor 4 (multi-core) remain unavailable.

References

Polars User Guide — Lazy APIspec · polars-user-guide-lazy
Polars — The streaming engine architecture (Ritchie Vink, 2024)article · polars-streaming-engine-blog
Apache Arrow — Columnar Format Specificationspec · apache-arrow-format-spec
Python for Data Analysis, 3rd ed., ch. 5 — Getting started with pandas (Wes McKinney)book · mckinney-python-data-analysis-ch5