ST
StateTrace
Visual Quant & Low-Latency Systems Lab
GitHub
Curriculum/vectorization

Vectorization

execution quality·L1 · combinator
Replacesthe belief that a faster Python loop is still a Python loop.

Vectorisation moves typed-array work into native kernels — one scalar operation over many elements becomes one C-level loop, with SIMD instructions multiplying lanes underneath. Speedup is 10–200× on numerical workloads. Object dtype, Python callbacks, and unvectorisable branching pull execution back to bytecode and the speedup vanishes.

The shape of the speedup

Vectorisation replaces a Python interpreter loop with a single call into a typed-array kernel. The kernel iterates in C (or in SIMD assembly underneath C); each scalar operation pays no per-iteration dispatch cost. The result is one of the largest reproducible speedups in numerical Python — 10–200× on workloads that map cleanly onto the typed-array model.

The unit of work changes. A Python loop's unit is one bytecode dispatch per element. A vectorised call's unit is one kernel invocation for the whole array. The setup cost is paid once; the per-element cost collapses by two orders of magnitude.

Same operation, three ways

python
import numpy as np

# 10M-element sum of squares.
xs = np.random.random(10_000_000).astype(np.float64)

def py_for(xs):
    total = 0.0
    for x in xs:
        total += x * x
    return total

def py_compr(xs):
    return sum(x * x for x in xs)

def np_vec(xs):
    return float(np.sum(xs * xs))    # element-wise multiply + reduce, both in C

# Apple M1 Pro, Python 3.12, numpy 2.0, 10M float64:
#   py_for      ~2200 ms       per-iteration interpreter dispatch
#   py_compr    ~1100 ms       bytecode-only, still dispatching
#   np_vec      ~12 ms         single C loop; SIMD + FMA inside
# Speedup: ~180× for np_vec vs py_for.

Two layers compose — library + SIMD

The first layer is library-internal C: the loop body runs in numpy's compiled C code, not in CPython's bytecode interpreter. Each iteration costs nanoseconds, not microseconds. This alone accounts for ~50× of the speedup in the example above.

The second layer is SIMD: numpy ships a runtime-dispatch kernel set; at import time it detects CPU features (AVX2, AVX-512 on x86; NEON on ARM) and picks the widest available kernel. On Apple M-series, NEON gives 2 lanes for float64 and 4 lanes for float32. On x86 with AVX-512, the lane count for float64 is 8.

The two layers multiply. Library-C alone is ~50×. Library + SIMD adds another 2–8× on top, depending on the lane count for the dtype. The remaining factor in the 10–200× total comes from FMA fusion and the tight straight-line loop structure inside the C kernel — on M-series float64 (2 SIMD lanes), library × SIMD = ~100× and FMA + loop structure account for the remaining ~1.8× up to the observed 180×.

Three ways vectorisation silently fails

Object dtype. A numpy array or pandas column with dtype=object stores Python references, not typed primitives. Every operation goes through CPython. Mixed-type columns (e.g. strings + None) degrade to object without warning; df.col operations then route through CPython at full per-element interpreter cost.

.apply() with a Python callback. df.col.apply(my_func) calls my_func once per row in pure Python. The vectorisation is gone; the cost goes back to interpreter dispatch per row. df.apply(func, axis=1) is the same trap at the DataFrame level.

Branching that cannot map to SIMD. A loop body that branches differently per element (if x > 0: ... else: ... in scalar code) cannot become straight-line SIMD. The numpy way is np.where(mask, then_value, else_value) — every lane evaluates both arms, then a mask picks the result. This is the right answer when both arms are cheap. When one arm is expensive (e.g. np.log(x) while the other returns 0), np.where still pays the expensive arm's cost on every element; np.select or boolean-mask assignment evaluates each arm only where its mask is true.

Each failure mode looks like fast code at the source level. The detection is measurement.

The .apply() trap, and the rewrite

python
import numpy as np
import pandas as pd

df = pd.DataFrame({
    "price":  np.random.uniform(100, 110, 10_000_000),
    "shares": np.random.randint(1, 1000, 10_000_000),
    "side":   np.random.choice(["buy", "sell"], 10_000_000),
})

# Looks vectorised. Isn't.
def signed_notional(row):
    sign = 1 if row["side"] == "buy" else -1
    return sign * row["price"] * row["shares"]

df["notional_slow"] = df.apply(signed_notional, axis=1)     # ~9000 ms on M1 Pro

# Actually vectorised.
sign = np.where(df["side"] == "buy", 1, -1)
df["notional_fast"] = sign * df["price"] * df["shares"]      # ~45 ms on M1 Pro
# Speedup: ~200×. The apply hid a per-row Python callback.

Measurement signals

Vectorisation is one of the rare optimisations where the speedup is large enough to be obvious in casual timing. Wall-clock per element drops from microseconds to nanoseconds. A timeit over 10M elements moving from seconds to tens of milliseconds is the typical signature.

When the speedup is not present, the cause is one of the three failure modes. The fastest detection is cProfile over the pipeline — a Python function still being called per row appears as a tall bar with the row count as its call count. line_profiler on the suspect function names the row-callback culprit precisely.

The deeper diagnostic is df.dtypes — anything object is a vectorisation killer. df.info() shows column dtypes and memory use. Numerical data stored as object registers as a Python-reference column; operations then route through CPython at full per-element cost. The fix is to convert to a typed dtype (float64, int32, category, string[pyarrow]) at the point of ingestion.

Find and fix the hidden non-vectorisation

Look at the .apply() trap code above. Three tasks:

(a) Identify the reason the slow version is slow — which of the three failure modes from the previous section is hitting?

(b) Rewrite the slow version using broadcasting (no .apply, no Python callback). Measure the speedup on your own hardware with timeit over the same 10M-row DataFrame.

(c) Take the side column above and leave it as object dtype (it already is). Run df[df['side'] == 'buy'] and measure with timeit. Then convert via df['side'] = df['side'].astype('category') and measure again. Report both numbers.

Expected: (a) `.apply(axis=1)` calls `signed_notional` once per row in pure Python — failure mode #2. (b) `np.where(df['side'] == 'buy', 1, -1) * df['price'] * df['shares']` runs in tens of milliseconds vs. seconds for the apply version; on Apple M1 Pro, 10M rows, the speedup is roughly 200×. (c) The `object` column registers as `dtype: object`. Converting to `category` or `string[pyarrow]` recovers vectorised comparisons; on a 10M-row string equality (`df.side == 'buy'`), the speedup from object→pyarrow is typically 5–20×.

Prerequisites(root concept)
Bridges
  • simd-lanesmodel → implementation
    One scalar expression over many typed elements becomes SIMD work when the kernel owns the loop. SIMD lanes are the hardware mechanism; vectorisation is the API surface that exposes them through numpy and pandas.
    Where this shows up
    • AVX2 256-bit register holds 8 int32 lanes; AVX-512 holds 16
    • Polars sum kernel runs the inner loop in vectorised Rust over Arrow buffers
    • Lane count is the unit of speedup the compiler exposes to the kernel
  • numerical-stabilityshared failure mode
    Vectorised reductions can produce different numerical results than the equivalent scalar loop because operand order changes. A Stage 3 problem the textbooks rarely warn about — the same Sharpe ratio over the same returns can disagree at the third decimal between scalar and vectorised code.
Done state

Evidence the learner produces, checks that confirm it.

Evidence
  • artifactMeasurement on the learner's own hardware showing 100× or better speedup on the lab's `.apply` → broadcast rewrite. The before/after timings recorded.
  • observable behaviorNames the three vectorisation killers (object dtype, `.apply` with Python callback, unvectorisable branching) and gives a fix for each. Can spot `df.apply(..., axis=1)` in a code review as a likely vectorisation killer without running the code.
Checks
  • manualExplains the two layers of speedup (library-internal C + SIMD) and gives a number for each. Reference answer: ~50× from library-C alone, then 2–8× from SIMD on top depending on the dtype's lane count.
  • manualGiven an arbitrary pandas pipeline of 4–6 operations, identifies which steps are vectorised and which use a Python callback. Bonus: gives the rewrite to vectorise the callback step.
  • manualExplains why vectorised reductions can produce different floating-point results than the equivalent scalar loop on the same input. Reference answer: operand order changes when SIMD lanes accumulate partial sums in parallel; floating-point addition is not associative.
References