ST
StateTrace
Visual Quant & Low-Latency Systems Lab
GitHub
Curriculum/shared-counter

Shared Counter

synchronization·L0 · atom
Replacesthe belief that an increment on a shared integer is atomic, and that `volatile` makes it so.

Two threads, one `int64_t`, one `+1` each, one million times: the expected final value is 2,000,000; the observed value is closer to 520,000. The lost updates expose the read-modify-write window that all of synchronisation exists to close.

Bug shape

Two threads each add 1 to the same `int64_t` one million times. The expected final value is 2,000,000; the observed value is closer to 520,000. The read–modify–write window swallows the lost updates.

Pitfall demolished — An increment on a shared integer is atomic, and `volatile` makes it so.

Three threads, one counter. Switch approach to see the bug close — and switch language to compare the C++ and Rust shapes.

Steps
1/6
Stateshared + per-thread
T1line 4
T1 localread
local0
shared
shared memory
counter0
counter per stepnow = 0
T2
T2 local
local
Per increment2.50nsmedian of 10 reps
Correctness 74% lost1,026,290 / 4,000,000
vs Non-Atomicbaseline (but wrong)fast but breaks
Setup4 threadsCPP · 1M iters · Apple M1 Pro
OBSERVED

T1 reads counter = 0 into its local variable.

Non-Atomic
T1
1int counter = 0;
2
3void worker() {
4    int local = counter;  // read
5    local = local + 1;    // compute
6    counter = local;      // write
7}
Historystep 1 / 6
step 1/6
Counter ended at 1,026,290 of 4,000,000 expected — 74% of writes vanished. The fix is one of these →

Fixes

Atomic closes the window in hardware — a single LOCK XADD on x86, LDADD on AArch64. The increment is one indivisible operation. Mutex closes the window in software — only one thread holds the lock at a time, so the read, compute, and write run as a sequence no other thread can interleave with. Both are correct fixes. The choice is not correctness; it is latency profile under contention.

In Rust, the non-atomic version on this page is static mut, which requires unsafe. The borrow checker rejects safe &mut aliasing across threads at compile time — the same window the C++ version exposes at run time. Same bug shape; Rust closes the window earlier in the toolchain.

Cost model

ApproachCorrectnessLatencyDominant cost
Non-Atomicincorrect~1 ns/opLost updates. Final count drifts proportional to contention.
Atomic (relaxed)correct~3–8 ns/opCache-line ping-pong between cores. Throughput collapses past 8 contending threads.
Mutexcorrect~25 ns/op uncontendedKernel wait queue on contention. p99 tail latency spikes into microseconds.

Where this shows up in a trading system

  • Market Data PipelineSequence-number counters on a feed handler track gaps. A non-atomic counter under-reports gaps; an atomic counter is correct and ns-cheap. Mutex is too slow for the hot path.latency · atomic acceptable; mutex disqualifies
  • Metrics SystemPer-strategy fill counters and rejection counters are the textbook atomic case. A non-atomic increment silently under-reports events the same way the lost updates here under-report increments.latency · p99.9 must stay stable — atomic is the only option
  • Matching EngineOrder IDs are monotonic sequence numbers. Atomic fetch-add is the standard generator. Lock-free queues for replication carry the same atomic primitive underneath.

Why `volatile` does not fix this

volatile instructs the compiler not to cache the variable in a register across uses. It does not make the read-modify-write sequence indivisible. The instruction stream is still three steps; another thread can still interleave between them. volatile is the right tool for memory-mapped I/O, not for thread safety.

Run it

Build and run the C++ benchmark: pnpm bench:cpp. Read the final counter value for each of the three approaches. Increase N and thread_count in benchmarks/cpp/shared_counter.cpp until the non-atomic race is observable on your machine.

Expected: Non-atomic: `final_count` is materially less than `2 * N` (typically 30–70% on Apple Silicon). Atomic and mutex: `final_count == thread_count * N` exactly.

Prerequisites(root concept)
Unlocks
Bridges
  • cache-coherenceshared mechanism
    Atomic RMW and mutex acquire both invalidate the cache line holding the counter. The contention cost in both cases is the same coherence traffic.
    Where this shows up
    • MESI invalidates the line on every write from another core
    • False sharing benchmark: two threads, two ints in one 64-byte line
    • M1 Pro L2 is shared across performance cores; cross-cluster traffic crosses the fabric
  • metrics-undercountingshared failure mode
    A non-atomic increment in a metrics counter under-reports events the same way the lost updates here under-report increments.
    Where this shows up
    • Prometheus client counter using a plain `+=` under contention
    • Lost increments scale with thread count and contention window length
    • Counter total deviates from event-log row count by the dropped writes
Done state

Evidence the learner produces, checks that confirm it.

Evidence
  • artifactBenchmark output capturing `final_count < 2 * N` for the non-atomic run and `final_count == 2 * N` for the atomic and mutex runs on the learner's own hardware.
  • observable behaviorCan articulate, in one sentence each: (1) why `volatile` does not fix the race, (2) why an atomic RMW with `memory_order_relaxed` is correct for a pure counter, (3) why mutex tail latency spikes under contention but atomic tail latency does not.
Checks
  • command · exit 0pnpm bench:cppexpects output to include: non-atomic, atomic, mutex
  • trace · shared-counter/non-atomicIdentifies the step at which both threads hold the same value in their local register and neither has written back yet — this is the moment one of the two increments is guaranteed to vanish.
  • manualExplains why `volatile int` does not eliminate the race without using the word 'compiler'. (Reference answer: the read-modify-write sequence is still three separate instructions at the machine level; `volatile` constrains the compiler's caching of the value, not the hardware's interleaving of the instructions.)
References