Curriculum/arrow-columnar-layout

Arrow Columnar Layout

data engineering·L0 · atom

Replacesthe belief that a DataFrame is a 2D array stored row-by-row.

Apache Arrow stores each column as a contiguous typed buffer plus a validity bitmap. SIMD scans fit, cache lines fit, and the buffers travel zero-copy between Polars, DuckDB, ClickHouse, Spark, and Arrow Flight. The atom underneath every modern analytics tool.

What columnar means

A traditional row-oriented DataFrame stores tuples in memory: row 1's (col1, col2, col3), then row 2's (col1, col2, col3), etc. Reading column 1 alone means touching every row's memory; the CPU pulls in col2 and col3 data it never uses.

Columnar storage inverts the layout. Each column is its own contiguous buffer — all of col1's values laid out in memory, then all of col2's, then all of col3's. Reading column 1 touches exactly column 1's memory.

The trade: row inserts touch N buffers (one per column); column scans touch one. Analytics workloads are dominated by column scans, so columnar wins by ~10× on memory bandwidth for a 10-column table (one column scan touches 1/10 of a row-store's cache lines). OLTP workloads are dominated by single-row inserts, so row stores still win there.

Arrow's three primitives

Apache Arrow is a specific columnar memory layout standard. Each column is three things:

1. A typed buffer. int64, float32, bool — laid out contiguously. The buffer has no per-element type tag; the type is metadata on the column, not on each value. A 10M-element int64 column is exactly 80 MB; nothing more.

2. A validity bitmap. One bit per element, 1 if the value is present, 0 if NULL. The bitmap is its own small buffer alongside the values buffer. SIMD masking of nulls is one AND instruction per cache line of bits. No sentinel value (like NaN or -2^63) needed in the data.

3. For variable-length types, an offsets buffer. Strings and lists need this. A column of strings is not an array of string objects — it is a flat byte buffer plus an int32 offsets buffer naming where each string starts. Slicing a string column is pointer arithmetic on the offsets, no string copies.

Arrow buffer layout in pyarrow

python

import pyarrow as pa
import numpy as np

# A column of 10M int64 values, with ~1% nulls.
values = np.arange(10_000_000, dtype=np.int64)
mask   = np.random.random(10_000_000) > 0.01      # True = present
arr    = pa.array(values, mask=~mask, type=pa.int64())

print(arr.type)                  # int64
print(arr.null_count)            # ~100_000
print(arr.buffers())             # [validity bitmap, values buffer]
print(arr.buffers()[1].size)     # 80_000_000 bytes — 10M × 8, zero per-element overhead
print(arr.buffers()[0].size)     # 1_250_000 bytes — one bit per element (10M / 8)

# That is the entire memory footprint of a typed nullable column.
# No object headers, no boxed integers, no per-element type tags.

Why this layout is right for analytics

Three properties stack multiplicatively.

Cache-line economy. A 64-byte cache line holds 8 int64s. A column scan moves through cache lines linearly — every byte loaded gets used. A row scan over a 10-column table wastes 9/10 of every cache line on unrelated columns. Memory bandwidth used is ~10× lower for columnar.

SIMD compatibility. A typed contiguous buffer is exactly the shape numpy and SIMD instructions expect. arr.sum() runs as one vectorised C loop with AVX/NEON; the same op on row-stored data needs a stride-aware loop that defeats most CPU SIMD.

Predicate pushdown and projection pruning. When filtering rows, the engine reads the predicate column first, ANDs the validity bitmap with the predicate mask, then loads the columns it needs only for surviving rows. Row stores cannot do this; they always read all columns. Polars's lazy frame turns this into a query-plan optimisation; DuckDB does it at scan time.

The 10–30× speedup of Polars / DuckDB over pandas on aggregate analytics queries is the product of these three factors, not the sum. The headline single-column-scan speedup the lab below measures runs higher (50–200×) because the cache-line-economy term dominates when the query touches one column out of many.

Zero-copy interchange — the cross-tool gift

Arrow's most underappreciated property: the buffers travel between languages and tools without serialisation.

Polars stores its DataFrame internally as Arrow buffers. polars.DataFrame.to_arrow() returns the same memory wrapped in pyarrow's Python API — no copy. df.to_pandas() hands the same int64 buffer to pandas as a numpy array — no copy for non-nullable numeric columns. Columns with a non-empty validity bitmap (any nulls) force a conversion pass into a float64 NaN-encoded numpy column; string, decimal, boolean, and list types also copy. The zero-copy property holds for the common analytics case of typed-primitive columns without nulls; nullable or variable-length columns pay a conversion.

DuckDB reads parquet into Arrow result sets. pl.from_arrow(duckdb.execute(...).arrow()) — no copy. Across processes, Arrow Flight (a gRPC-based wire protocol) sends the bytes that ARE the in-memory buffers — no serialisation pass.

This is the integration story modern analytics depends on. Pre-Arrow, pandas + numpy interchange existed but was duck-typed and fragile across versions. Arrow is the typed, specified, version-stable cross-tool form.

Parquet — same shape, disk edition

Parquet is the on-disk file format that mirrors Arrow's in-memory layout. A Parquet file stores:

Row groups (typically 128 MB each) — units of independent parallel reads.
Column chunks within a row group — each column compressed (Snappy, ZSTD, LZ4) and statistics-summarised (min, max, null count, distinct count).
Footer metadata at the file end naming where every chunk lives.

The mirror: an Arrow column buffer in memory is the decompressed form of a Parquet column chunk on disk. Reading 'just the 3 columns I need' is a Parquet primitive — the engine seeks to those column chunks and skips the rest. The same operation on CSV or fixed-width binary requires reading the whole file.

Polars, DuckDB, Spark, ClickHouse all read Parquet natively. Files written by one tool are readable by all others. The format is shared at three layers: shared on disk (Parquet), shared in memory (Arrow), shared on the wire (Arrow Flight).

When columnar does not fit

Three workloads where row-orientation still wins.

Point lookups by primary key. SELECT * FROM users WHERE id = 42. A row store does one disk seek for the whole row; a column store does N seeks (one per column). PostgreSQL row-orients its tables for this reason. Hydra adds columnar tables on top of Postgres for analytics; Citus shards row-oriented tables across nodes (with optional columnar storage via its cstore lineage). The split is OLTP-row + OLAP-column, often in the same cluster.

Frequent single-row updates. OLTP workloads. Inserting one row into a columnar table touches every column buffer. Most columnar tools handle inserts by buffering recent rows in row form (a 'delta store') and bulk-converting to columnar in the background — see Apache Druid's segment design, ClickHouse's MergeTree.

High-cardinality variable-length columns. Arrow's offsets buffer adds 4 or 8 bytes of overhead per element for variable-length types. A column of 100M unique strings has 800 MB of offsets alone before any string data. Dictionary encoding (Arrow's dictionary type) helps when cardinality is low; the overhead is real when cardinality matches row count.

Measure columnar advantage

Construct a 10M-row DataFrame with 10 numeric columns (col_0 through col_9, all float64, random values). Run two queries on the same data, in both pandas and Polars:

(a) Sum col_0 (single-column scan). (b) Compute col_0 + col_1 for every row (two-column scan).

Measure all four queries with timeit. Report the four numbers and the two speedups (Polars / pandas for each query). Explain in writing: (1) which query benefits more from columnar layout, (2) what the cache-line economy contributes to query (a)'s speedup specifically, (3) what would happen to the (b) speedup if you computed col_0 + col_1 + ... + col_9 instead (10-column sum).

Expected: Query (a) (one-column sum) benefits enormously from columnar — Polars typically 50–200× faster than pandas on this. Query (b) (two-column add) benefits less — both columns must be touched, so cache-line economy advantage shrinks (typically 5–20×). The pattern: more columns skipped → more columnar wins. For query (a), 9 of 10 columns are unread; bandwidth used is ~1/10 of row-stored. For (b), 8 of 10 unread. For a 10-column sum, all 10 columns are touched and the cache-line advantage disappears entirely — the Polars vs pandas gap shrinks to whatever SIMD + threading contributes (typically 2–4×).

Prerequisites

vectorization

Unlocks

Bridges

⇄
parquet-row-groupsshared mechanism
Parquet's row groups + column chunks are the same columnar shape at the file boundary that Arrow uses in memory. The decompressed Parquet column chunk IS the in-memory Arrow buffer.
Where this shows up
- Row group = horizontal partition; column chunk = one column within it
- Min/max statistics on the chunk let the reader skip whole row groups
- Decompressed column chunk is laid out as an Arrow-compatible buffer
cache-lineshared mechanism
Columnar layout's primary performance lever is cache reuse: a single column scan touches one stream of cache lines, not the entire row. A 64-byte cache line holds 8 int64s; a 10-column scan wastes 9/10 of bandwidth on row stores.

Done state

Evidence the learner produces, checks that confirm it.

Evidence

artifactMeasurement table from the lab — four timings (pandas/Polars × queries a/b) on the learner's own hardware, with the two computed speedup ratios.
observable behaviorPredicts whether a SQL or DataFrame query will benefit from columnar layout by counting (columns touched) / (columns total). The lower the ratio, the bigger the columnar win.

Checks

manualNames Arrow's three column primitives (typed buffer + validity bitmap + offsets for variable-length types) and gives the memory footprint of a 10M-row `int64` column with 1% nulls. Reference answer: 80 MB values buffer + 1.25 MB validity bitmap = ~81.25 MB total.
manualNames two cross-tool zero-copy paths Arrow enables. Reference: Polars ↔ pandas via shared numpy buffers; DuckDB → Polars via pyarrow result sets; Polars → pandas via `to_pandas()`; cross-process via Arrow Flight; cross-disk via Parquet.
manualNames a workload where row-oriented storage wins, with reasoning. Reference: point lookups (one row across all columns = 1 row-seek vs N column-seeks); single-row updates (touches N column buffers vs 1 row); OLTP in general.

References

Apache Arrow — Columnar Format Specificationspec · apache-arrow-format-spec
C-Store: A Column-oriented DBMS (Stonebraker et al., VLDB 2005)paper · stonebraker-c-store-2005
Database Internals (Alex Petrov, 2019) — Part II: Distributed Systemsbook · petrov-database-internals
Python for Data Analysis, 3rd ed., ch. 5 — Getting started with pandas (Wes McKinney)book · mckinney-python-data-analysis-ch5