ST
StateTrace
Visual Quant & Low-Latency Systems Lab
GitHub
Curriculum/polars-lazy-frame

Polars Lazy Frame

data engineering·L2 · idiom
Replacesthe belief that DataFrame operations should execute immediately.

A lazy Polars pipeline records `select`, `with_columns`, `group_by_dynamic`, and `join` as a logical query plan; the optimiser then rewrites the plan before any data is touched. Predicate pushdown moves filters to the I/O boundary, projection pruning drops columns the query does not read, and operator fusion collapses adjacent maps into one pass. Eager evaluation gives up all three.

What lazy means

An eager DataFrame call (df.filter(...), df.select(...)) materialises the result in memory immediately. A lazy DataFrame call records the operation into a logical plan and returns without executing. Nothing materialises until you call .collect().

The difference is not just throughput. Eager execution forces every intermediate result into RAM in the order the user wrote the code. Lazy execution gives the optimiser the entire pipeline at once — it can reorder, drop, and fuse operations because it sees what the final result needs, not just the next step.

Polars exposes both modes. pl.DataFrame.filter(...) is eager; pl.LazyFrame.filter(...) is lazy. pl.scan_parquet() returns a LazyFrame; pl.read_parquet() returns an eager DataFrame.

Eager vs lazy — the same query, two evaluation strategies

python
import polars as pl

# === Eager: every operation executes immediately ===
df = pl.read_parquet("trades.parquet")            # reads ALL columns, ALL rows
df = df.filter(pl.col("symbol") == "AAPL")        # reads, then filters
df = df.select(["timestamp", "price", "shares"])  # already loaded, now drops cols
result = df.group_by("timestamp").agg(
    pl.col("price").mean(),
).collect()                                       # no-op for eager (already collected)

# === Lazy: operations are recorded as a plan ===
plan = (
    pl.scan_parquet("trades.parquet")             # plans the read, does NOT execute
      .filter(pl.col("symbol") == "AAPL")
      .select(["timestamp", "price", "shares"])
      .group_by("timestamp")
      .agg(pl.col("price").mean())
)

print(plan.explain())                             # logical plan, before optimisation
print(plan.explain(optimized=True))               # after optimiser rewrites
result = plan.collect()                           # executes the optimised plan

# After optimisation:
#   - filter is pushed into the parquet scan (only AAPL rows materialise)
#   - select is pushed into the parquet scan (only 3 columns materialise — and symbol
#     is dropped after the filter, even though it appeared in the filter predicate)
#   - the entire pipeline runs in a streaming pass; no intermediate DataFrame ever lives in RAM

Three rewrites the optimiser performs

Projection pruning. A select(['a', 'b']) on a 15-column table tells the optimiser: 13 columns are never read downstream. The optimiser pushes the column list back to the parquet scan; only a and b are decompressed and materialised. Saves I/O bandwidth, memory, and SIMD work. Single biggest win on wide tables.

Predicate pushdown. A filter(pl.col('symbol') == 'AAPL') after a parquet scan is rewritten to a parquet scan with the filter applied at read time. Parquet's column statistics (min, max, null count per row group) let the engine skip entire row groups whose symbol range cannot contain 'AAPL'. Saves disk reads.

Operator fusion. Adjacent map operations — df.with_columns(a+1).with_columns(a*2) — collapse into one pass over the data. The fused kernel runs in one SIMD loop instead of two. Saves the intermediate materialisation between the two with_columns calls.

All three rewrites are correctness-preserving: the result of the optimised plan is provably identical to the result of the original. The 1979 Selinger paper on System R named these moves; modern engines (DuckDB, ClickHouse, Polars, Snowflake) share the framework.

Reading a Polars query plan

python
# Plan output (abbreviated, polars 1.x):
#
# Before optimisation:
#   AGGREGATE
#     [col("price").mean()] BY [col("timestamp")] FROM
#     SELECT [col("timestamp"), col("price"), col("shares")] FROM
#       FILTER [(col("symbol")) == ("AAPL")] FROM
#         PARQUET SCAN trades.parquet
#           PROJECT */15 COLUMNS
#
# After optimisation:
#   AGGREGATE
#     [col("price").mean()] BY [col("timestamp")] FROM
#     PARQUET SCAN trades.parquet
#       PROJECT 3/15 COLUMNS
#       SELECTION: [(col("symbol")) == ("AAPL")]
#
# Three rewrites visible:
#   1. PROJECT */15 → PROJECT 3/15  (projection pruning: 12 columns never read)
#   2. FILTER moved into SCAN as SELECTION  (predicate pushdown: filter at I/O)
#   3. SELECT collapsed into the parquet scan's projection list (operator fusion)

What lazy buys you on real workloads

On a 1 TB parquet dataset with 15 columns and 1 month of intraday trades, the eager version of the example above reads ~1 TB, filters to AAPL rows (~5%), drops 12 columns. Wall-clock: roughly the time to scan 1 TB from disk + decompress.

The lazy version reads ~3 column-chunks per row group of the months containing AAPL, skips row groups whose symbol stats exclude AAPL, never materialises the 12 unused columns. Wall-clock: roughly the time to scan ~10–20 GB of selectively-decompressed data.

The speedup ratio depends on the selectivity of the filter and the column ratio in the projection. For a typical analytics query — WHERE timestamp BETWEEN ... AND symbol = '...' on a wide table — the lazy win is 10–100×, and the dominant term is parquet I/O skipped, not CPU saved.

When lazy does not help (or hurts)

Lazy is the wrong mode when:

Interactive exploration. A REPL or notebook where each cell's result is inspected before the next is written. .collect() after every step is awkward; eager mode keeps the feedback loop tight.

Small data, simple queries. Optimiser overhead is ~milliseconds. On a 10K-row DataFrame doing one filter + one aggregate, the plan rewrite is more expensive than the execution itself.

Operations that defeat the optimiser. A Python UDF passed to .map_elements() is a black box; the optimiser cannot reason past it, projection pruning stops, predicate pushdown stops. The lazy plan still runs but loses most of its leverage. The fix is to express the UDF in Polars expressions (composable expressions stay in the planner) or to push the UDF behind a .map_batches() boundary that the optimiser can see.

Streaming queries with side effects. A pipeline that writes intermediate results to disk for inspection cannot be rewritten — the side effect imposes ordering the optimiser must respect.

Measure a lazy rewrite

Take any parquet dataset on your machine (or download a sample from https://pola.rs/data/). Write the same query two ways:

(a) Eager: pl.read_parquet(...).filter(...).select(...).group_by(...).agg(...).

(b) Lazy: pl.scan_parquet(...) → same chain → .collect().

For the lazy version, print plan.explain(optimized=False) and plan.explain(optimized=True) and identify (1) which columns survive projection pruning, (2) which filter moved into the scan, (3) any operator fusions.

Then timeit both versions on the same dataset. Report the speedup. If the dataset is small enough that the difference is < 2×, scale up (more rows or wider table) until it is.

Expected: Eager reads every column for every row, then filters and projects in memory. Lazy reads only the columns named in `.select()` and only the rows that pass the filter (subject to parquet row-group statistics). On a wide parquet file (15+ columns) with a selective filter (<10% of rows match), the lazy speedup is typically 10–100×. The `.explain(optimized=True)` output names the rewrites — `PROJECT N/M COLUMNS` shrinks the column count, `SELECTION:` appears inside `SCAN` rather than as a separate `FILTER` node, and any chained `with_columns` collapse into one.

Bridges
  • time-bar-constructionimplementation → model
    `group_by_dynamic` turns a stream of timestamped events into bars and rolling windows — the same operation a Stage 4 backtester or a Stage 5 microstructure study needs on tick data. The lazy plan moves the window construction past the predicate, often eliminating most rows before the aggregation runs.
    Where this shows up
    • `group_by_dynamic` collapses tick stream into 1-minute OHLCV bars
    • Rolling window over event-time timestamps, not row index
    • Predicate pushdown moves the symbol filter ahead of the window
  • query-plannershared mechanism
    Lazy DataFrame planning, SQL query planning, and ClickHouse / DuckDB / Snowflake optimisation all run the same family of rule-based and cost-based transforms over a logical plan. The 1979 Selinger paper named the framework; the techniques have been mostly stable since.
Done state

Evidence the learner produces, checks that confirm it.

Evidence
  • artifactTwo `timeit` measurements (eager vs lazy) on the same query against the same parquet dataset on the learner's machine. The speedup ratio recorded, plus the `.explain(optimized=True)` output identifying which specific rewrites fired.
  • observable behaviorReads a Polars `.explain()` plan and identifies each rewrite — predicate pushdown (filter moves into scan), projection pruning (column count drops), operator fusion (adjacent maps collapse). Predicts whether a given query benefits from lazy mode by inspecting the column ratio and filter selectivity.
Checks
  • manualNames the three optimiser rewrites (predicate pushdown / projection pruning / operator fusion) and gives one concrete example of each. Reference: pushdown moves `WHERE` into `SCAN`; pruning drops unused columns; fusion collapses adjacent `with_columns` chains.
  • manualGiven a query with a Python UDF via `.map_elements()`, explains why projection pruning past the UDF fails and proposes the rewrite to keep the optimiser engaged. Reference: the UDF is opaque to the planner; rewrite as Polars expressions or use `.map_batches()` to preserve the optimiser boundary.
  • manualNames two workloads where eager mode is the right choice and two where lazy mode is the right choice. Reference: eager — REPL exploration, small data, simple queries; lazy — multi-step pipelines on wide datasets, parquet/CSV ingestion with filters, streaming joins.
References