ST
StateTrace
Visual Quant & Low-Latency Systems Lab
GitHub
Concepts/Synchronization/Shared Counter/Compare

Shared Counter · Compare

All three approaches side by side. Tradeoffs between correctness, latency, and where each one lands inside a real trading system.

Side-by-side comparison
ApproachCorrectnessSpeedLatencyp99ResourceBest use
Non-AtomicincorrectVery Fastnsn/alowDemo only.
AtomiccorrectFastns → μsstablelowCounters, flags, sequence numbers.
MutexcorrectSlowerμs → mscan spikemediumMulti-field critical sections, non-hot-path state.

Performance summary

Non-Atomic

Incorrect
Speed
Very Fast
Latency
ns
p99
n/a (incorrect)
Resource
low
Best for

Demo only.

Lowest overhead — but lost updates make it unusable.

Atomic

Correct
Speed
Fast
Latency
ns → μs
p99
stable
Resource
low
Best for

Counters, flags, sequence numbers.

Around 3x faster than mutex under low contention; degrades as contention grows.

Mutex

Correct
Speed
Slower
Latency
μs → ms
p99
can spike
Resource
medium
Best for

Multi-field critical sections, non-hot-path state.

Predictable under low contention; tail latency spikes under load.

Real-world impact

Where each approach lands in a real trading system. Hot-path fit says whether to reach for it on the critical path.

Non-Atomic
Matching Engineavoid on hot path

Order books rely on counters that must reach the same value every run. They don't here — Apple M1 Pro loses 74% of writes under 4-thread contention.

Example

An order-sequence counter incremented without synchronization on every accept produces gaps and duplicates. Audit and replay both break.

Metrics Systemavoid on hot path

A counter that drops 74% of increments under contention reports a quarter of its true rate. The dashboard reads calm during the actual incident.

Example

Request counters, drop counters, retry counters — anything mutated by more than one thread without ordering ends up wrong by the same factor.

Atomic
Market Data Pipelinegood fit

An SPSC ring buffer's head and tail are atomics — store with release on the producer side, load with acquire on the consumer. The Disruptor design is the canonical reference.

Example

`head.store(next, release)` on publish; `head.load(acquire)` on consume. Relaxed ordering on the counter, ordered handoff on the cursor.

Metrics Systemgood fit

High-frequency counters use `fetch_add(1, relaxed)`. The cost is ~50 ns per increment on M1 Pro — fine for anything under ~10 M ops/s per core.

Example

Packets received, messages published, rejects, acknowledgements — one counter per thread aggregated periodically beats a single contended counter past a few hundred thousand ops/s.

Execution Gatewaygood fit

A single-writer atomic `bool is_paused` lets the hot path branch in a few cycles. Compare-and-swap is reserved for the rare state transition.

Example

Feed handler sets `is_paused.store(true, release)`. The order sender reads `is_paused.load(acquire)` once per loop iteration. Zero locks on the hot path.

Mutex
Risk Enginedepends

A mutex protects a multi-field state update that must move together — position + last-trade timestamp + exposure cap. Atomics don't compose across fields.

Example

`std::lock_guard<std::mutex> g(state_mu); position += qty; last_trade = now; cap_used += notional;` — the three writes are seen together or not at all.

Matching Engineavoid on hot path

On the hot path of a single-threaded order book, a mutex per match adds ~110 ns per increment on M1 Pro. Single-thread ownership or message passing is cheaper at the same correctness.

Example

An LMAX-style Disruptor ring buffer with one consumer thread per book sustains millions of ops/sec without ever acquiring a lock on the hot path.

Backend Servicegood fit

For state mutated at human pace — configuration reloads, admin commands, infrequently-touched caches — mutex is the simplest correct choice. Atomic protocols are overhead the latency budget doesn't notice.

Example

A `RwLock<HashMap<...>>` around a feature-flag cache reloaded once per minute. Readers are common, writers are rare, contention is negligible.