ST
StateTrace
Visual Quant & Low-Latency Systems Lab
GitHub
Curriculum/gil

Global Interpreter Lock (Python)

runtime·L2 · idiom
ReplacesPython threads run CPU bytecode in parallel.

The GIL is a runtime-wide lock that serialises CPython bytecode execution while leaving I/O syscalls and native-extension code concurrent. CPU-bound multithreading in pure Python sees zero speedup. I/O-bound work and numpy-bound work scale with thread count.

What the GIL is, and why

The Global Interpreter Lock is a single mutex inside CPython that serialises access to interpreter state. Only one OS thread executes Python bytecode at any moment. Threads exist; they take turns. The lock is held by the interpreter loop and released around blocking syscalls, long-running C extensions, and at the periodic switch interval (5 ms default, sys.getswitchinterval(), unchanged from Python 3.2 through 3.14).

The reason: CPython manages memory by reference counting. Every PyObject has a refcount field that increments on assignment and decrements on scope exit. Without a global lock, every increment and decrement on shared objects would need an atomic operation. The GIL is the design choice to use one big lock instead of millions of small atomic ops — the trade is CPU-bound thread parallelism for refcount-update simplicity. The trade was made in CPython's early threading design (mid-1990s) and Python has lived with it ever since.

CPU-bound — threading does not help

python
import threading
import time

def cpu_work(n):
    total = 0
    for i in range(n):
        total += i * i
    return total

# One thread, 100M iterations.
start = time.perf_counter()
cpu_work(100_000_000)
print(f"one thread:  {time.perf_counter() - start:.2f}s")

# Two threads, 50M iterations each — same total work.
start = time.perf_counter()
t1 = threading.Thread(target=cpu_work, args=(50_000_000,))
t2 = threading.Thread(target=cpu_work, args=(50_000_000,))
t1.start(); t2.start()
t1.join(); t2.join()
print(f"two threads: {time.perf_counter() - start:.2f}s")

# Apple M1 Pro, CPython 3.13 (with GIL):
#   one thread:  3.2s
#   two threads: 3.4s    — slightly slower, not faster.
# The GIL serialises both threads onto effectively one core's progress.

When the GIL does not matter — I/O-bound and native code

The GIL is released around blocking system calls. time.sleep, socket.recv, file.read, HTTP libraries (requests, httpx), database drivers — all spend most of their wall time in C code waiting on the kernel. While one thread waits, another holds the GIL and progresses.

The GIL is also released around long-running C extensions: numpy operations, pandas operations on numeric columns, sqlite calls, regex on large strings, image processing, most scipy and scikit-learn calls. Any time the hot work happens in compiled C/C++/Rust code, the extension typically releases the GIL on entry and re-acquires it on return.

This is the reason requests-based web scrapers scale linearly with threads, numpy-heavy pipelines scale with cores, and Polars is faster than pandas at multi-threaded work. The GIL covers Python bytecode; it gets out of the way of everything else.

I/O-bound — threading does help

python
import threading
import time

def io_work(n):
    for _ in range(n):
        time.sleep(0.1)    # releases the GIL during the sleep

# One thread, 200 sleeps.
start = time.perf_counter()
io_work(200)
print(f"one thread:  {time.perf_counter() - start:.2f}s")

# Two threads, 100 sleeps each.
start = time.perf_counter()
t1 = threading.Thread(target=io_work, args=(100,))
t2 = threading.Thread(target=io_work, args=(100,))
t1.start(); t2.start()
t1.join(); t2.join()
print(f"two threads: {time.perf_counter() - start:.2f}s")

# Apple M1 Pro, CPython 3.13:
#   one thread:  20.0s
#   two threads: 10.0s   — 2× speedup. The GIL releases during time.sleep.

Three escape hatches for CPU-bound work

multiprocessing. One Python interpreter per process, one GIL per interpreter. Use multiprocessing.Pool or concurrent.futures.ProcessPoolExecutor. The cost is process startup (~10–30 ms fork on Linux, ~100–300 ms spawn on macOS) and inter-process data marshalling (pickle round-trip). Worth it when the work is large and the data is small.

Native extensions that release the GIL. numpy, polars, pandas, scipy, scikit-learn, pytorch, sqlite. The hot path runs in C/C++/Rust where the GIL is dropped at function entry and re-acquired at return. This is the path Stage 7 (Rust + PyO3) leads to — rewrite the CPU-bound kernel in Rust, expose it via PyO3, the GIL is released around the call.

PyPy. A different Python implementation with a tracing JIT. PyPy still has a GIL but runs CPU-bound pure Python 5–50× faster than CPython, often eliminating the need for parallelism. The limit is C-extension ecosystem compatibility — PyTorch and NumPy work but with overhead; pandas works but is slow.

PEP 703 — the GIL becomes optional

Python 3.13 ships an experimental --without-gil build (free-threaded CPython, PEP 703 by Sam Gross, accepted 2023). It removes the global lock by replacing refcount atomics with per-object locking and biased reference counting — the thread that created an object owns its refcount and updates it without a lock; other threads use atomics only on shared objects. Single-threaded performance impact is small (~5–10% slower in early benchmarks). Multi-threaded CPU-bound Python finally scales linearly with cores.

Python 3.14 (October 2025) makes free-threaded mode officially supported. The migration timeline: 3.15+ may make it the default. The long pole is the C-extension ecosystem — every extension must be marked Py_MOD_GIL_NOT_USED to opt in. numpy, polars, and most major libraries are migrating; the long tail will take years.

Predict and measure

Three Python sketches below. For each, predict (a) whether the workload is CPU-bound or I/O-bound, (b) whether multithreading would help, (c) what the speedup ceiling is on a 4-core machine — and explain in one sentence per sketch.

# Sketch 1: pure Python JSON parsing
for line in open('big.jsonl'):
    data = json.loads(line)
    process(data)
# Sketch 2: orjson + numpy pipeline
import orjson, numpy as np
for chunk in chunks:
    arr = np.array(orjson.loads(chunk))
    result = np.fft.fft(arr)
# Sketch 3: aiohttp-based scraper
async with aiohttp.ClientSession() as session:
    results = await asyncio.gather(*[session.get(u) for u in urls])

Then construct a small benchmark for whichever sketch is closest to your real workload and measure on your own hardware. Compare your prediction to the measurement.

Expected: Sketch 1: CPU-bound (pure Python `json.loads`). Multithreading does not help — speedup ceiling ~1×. Use multiprocessing, or switch the parser to `orjson` (which releases the GIL). Sketch 2: I/O at chunk-fetch, CPU at work layer — but both `orjson.loads` and `numpy.fft` release the GIL. Multithreading helps. Speedup ceiling on 4 cores: ~3–4×, limited by chunk-fetch overlap vs FFT cost. Sketch 3: I/O-bound (network). `asyncio.gather` already issues requests concurrently within one thread; threading on top adds little. Speedup ceiling: limited by server-side parallelism and network round-trip count, not thread count.

Bridges
  • io-bound-vs-cpu-boundshared failure mode
    GIL's cost is invisible for I/O-bound workloads and dominant for CPU-bound. The same workload-shape distinction reappears in async runtime design (Stage 7) and feed-handler architecture (Stage 5) — the question 'where does the time actually go' is the same question in three pillars.
    Where this shows up
    • `asyncio` saturates a single core on 10k idle sockets; one thread suffices
    • NumPy matmul saturates all cores; one thread is the bottleneck
    • The same code shape — `for x in items: work(x)` — has opposite scaling laws
  • free-threaded-cpythonimplementation → model
    Python 3.13's free-threaded build (PEP 703) replaces a single global lock with per-object locking + biased reference counting. The lock isn't removed; it's distributed and skewed in favour of the thread that created each object.
Done state

Evidence the learner produces, checks that confirm it.

Evidence
  • artifactMeasurement on the learner's own hardware showing the CPU-bound demo runs in roughly the same time with one vs. two threads (or slightly slower with two), and the I/O-bound demo halves with two threads.
  • observable behaviorPredicts for an arbitrary Python code sketch whether the GIL blocks multithreaded speedup, by reasoning about (a) which library calls release the GIL and (b) whether the hot path is pure Python or a C extension.
Checks
  • manualPredicts the GIL behaviour for the lab's three sketches and explains reasoning in one sentence each.
  • manualExplains why pandas operations 'release the GIL' but pandas-heavy code is often still effectively single-threaded in practice. Reference answer: pandas releases the GIL inside individual ops, but per-op work is short and per-op thread-coordination overhead dominates; the per-op release does not aggregate into thread-level parallelism. The typical fix is `multiprocessing` or switching to Polars (which keeps the work inside the Rust core across operations).
  • manualExplains PEP 703's mechanism without using the phrase 'remove the lock'. Reference answer: per-object locking + biased reference counting. The lock isn't removed; it's distributed across objects and skewed in favour of single-thread workloads via the bias.
References