Curriculum/shared-counter

Shared Counter

synchronization·L0 · atom

Replacesthe belief that an increment on a shared integer is atomic, and that `volatile` makes it so.

Two threads, one `int64_t`, one `+1` each, one million times: the expected final value is 2,000,000; the observed value is closer to 520,000. The lost updates expose the read-modify-write window that all of synchronisation exists to close.

Bug shape

Two threads each add 1 to the same `int64_t` one million times. The expected final value is 2,000,000; the observed value is closer to 520,000. The read–modify–write window swallows the lost updates.

Pitfall demolished — An increment on a shared integer is atomic, and `volatile` makes it so.

Three threads, one counter. Switch approach to see the bug close — and switch language to compare the C++ and Rust shapes.

Steps

1/6

Stateshared + per-thread

T1line 4

T1 localread

local0

shared

shared memory

counter0

counter per stepnow = 0

T2 local

local—

Per increment2.50nsmedian of 10 reps

Correctness✗ 74% lost1,026,290 / 4,000,000

vs Non-Atomicbaseline (but wrong)fast but breaks

Setup4 threadsCPP · 1M iters · Apple M1 Pro

OBSERVED

T1 reads counter = 0 into its local variable.

Non-Atomic

1int counter = 0;
2
3void worker() {
4    int local = counter;  // read
5    local = local + 1;    // compute
6    counter = local;      // write
7}

Historystep 1 / 6

space · ← → · rstep 1/6

Counter ended at 1,026,290 of 4,000,000 expected — 74% of writes vanished. The fix is one of these →

to Atomic to Mutex

Fixes

Atomic closes the window in hardware — a single LOCK XADD on x86, LDADD on AArch64. The increment is one indivisible operation. Mutex closes the window in software — only one thread holds the lock at a time, so the read, compute, and write run as a sequence no other thread can interleave with. Both are correct fixes. The choice is not correctness; it is latency profile under contention.

In Rust, the non-atomic version on this page is static mut, which requires unsafe. The borrow checker rejects safe &mut aliasing across threads at compile time — the same window the C++ version exposes at run time. Same bug shape; Rust closes the window earlier in the toolchain.

Cost model

Approach	Correctness	Latency	Dominant cost
Non-Atomic	incorrect	~1 ns/op	Lost updates. Final count drifts proportional to contention.
Atomic (relaxed)	correct	~3–8 ns/op	Cache-line ping-pong between cores. Throughput collapses past 8 contending threads.
Mutex	correct	~25 ns/op uncontended	Kernel wait queue on contention. p99 tail latency spikes into microseconds.

Where this shows up in a trading system

Market Data PipelineSequence-number counters on a feed handler track gaps. A non-atomic counter under-reports gaps; an atomic counter is correct and ns-cheap. Mutex is too slow for the hot path.latency · atomic acceptable; mutex disqualifies
Metrics SystemPer-strategy fill counters and rejection counters are the textbook atomic case. A non-atomic increment silently under-reports events the same way the lost updates here under-report increments.latency · p99.9 must stay stable — atomic is the only option
Matching EngineOrder IDs are monotonic sequence numbers. Atomic fetch-add is the standard generator. Lock-free queues for replication carry the same atomic primitive underneath.

Why `volatile` does not fix this

volatile instructs the compiler not to cache the variable in a register across uses. It does not make the read-modify-write sequence indivisible. The instruction stream is still three steps; another thread can still interleave between them. volatile is the right tool for memory-mapped I/O, not for thread safety.

Run it

Build and run the C++ benchmark: pnpm bench:cpp. Read the final counter value for each of the three approaches. Increase N and thread_count in benchmarks/cpp/shared_counter.cpp until the non-atomic race is observable on your machine.

Expected: Non-atomic: `final_count` is materially less than `2 * N` (typically 30–70% on Apple Silicon). Atomic and mutex: `final_count == thread_count * N` exactly.

Prerequisites(root concept)

Unlocks

race-condition
mutex
false-sharing
memory-ordering
atomic-rmw-primitives
lock-free-queue

Bridges

⇄
cache-coherence✓shared mechanism
Atomic RMW and mutex acquire both invalidate the cache line holding the counter. The contention cost in both cases is the same coherence traffic.
Where this shows up
- MESI invalidates the line on every write from another core
- False sharing benchmark: two threads, two ints in one 64-byte line
- M1 Pro L2 is shared across performance cores; cross-cluster traffic crosses the fabric
⚠
metrics-undercounting✓shared failure mode
A non-atomic increment in a metrics counter under-reports events the same way the lost updates here under-report increments.
Where this shows up
- Prometheus client counter using a plain `+=` under contention
- Lost increments scale with thread count and contention window length
- Counter total deviates from event-log row count by the dropped writes

Done state

Evidence the learner produces, checks that confirm it.

Evidence

artifactBenchmark output capturing `final_count < 2 * N` for the non-atomic run and `final_count == 2 * N` for the atomic and mutex runs on the learner's own hardware.
observable behaviorCan articulate, in one sentence each: (1) why `volatile` does not fix the race, (2) why an atomic RMW with `memory_order_relaxed` is correct for a pure counter, (3) why mutex tail latency spikes under contention but atomic tail latency does not.

Checks

command · exit 0pnpm bench:cppexpects output to include: non-atomic, atomic, mutex
trace · shared-counter/non-atomicIdentifies the step at which both threads hold the same value in their local register and neither has written back yet — this is the moment one of the two increments is guaranteed to vanish.
manualExplains why `volatile int` does not eliminate the race without using the word 'compiler'. (Reference answer: the read-modify-write sequence is still three separate instructions at the machine level; `volatile` constrains the compiler's caching of the value, not the hardware's interleaving of the instructions.)

References

C++ memory_order referencespec · cppreference-memory-order
Atomic<> Weapons (Herb Sutter, C++ and Beyond 2012)talk · sutter-atomic-weapons
Preshing on Programming — Memory ordering at compile timearticle · preshing-memory-ordering
Rust nomicon — Atomicsrust · rustnomicon-atomics
LMAX Disruptor technical paperindustry · lmax-disruptor-paper