Shared Counter · Compare
All three approaches side by side. Tradeoffs between correctness, latency, and where each one lands inside a real trading system.
| Approach | Correctness | Speed | Latency | p99 | Resource | Best use |
|---|---|---|---|---|---|---|
| Non-Atomic | incorrect | Very Fast | ns | n/a | low | Demo only. |
| Atomic | correct | Fast | ns → μs | stable | low | Counters, flags, sequence numbers. |
| Mutex | correct | Slower | μs → ms | can spike | medium | Multi-field critical sections, non-hot-path state. |
Performance summary
Non-Atomic
Incorrect- Speed
- Very Fast
- Latency
- ns
- p99
- n/a (incorrect)
- Resource
- low
Demo only.
Lowest overhead — but lost updates make it unusable.
Atomic
Correct- Speed
- Fast
- Latency
- ns → μs
- p99
- stable
- Resource
- low
Counters, flags, sequence numbers.
Around 3x faster than mutex under low contention; degrades as contention grows.
Mutex
Correct- Speed
- Slower
- Latency
- μs → ms
- p99
- can spike
- Resource
- medium
Multi-field critical sections, non-hot-path state.
Predictable under low contention; tail latency spikes under load.
Real-world impact
Where each approach lands in a real trading system. Hot-path fit says whether to reach for it on the critical path.
Order books rely on counters that must reach the same value every run. They don't here — Apple M1 Pro loses 74% of writes under 4-thread contention.
An order-sequence counter incremented without synchronization on every accept produces gaps and duplicates. Audit and replay both break.
A counter that drops 74% of increments under contention reports a quarter of its true rate. The dashboard reads calm during the actual incident.
Request counters, drop counters, retry counters — anything mutated by more than one thread without ordering ends up wrong by the same factor.
An SPSC ring buffer's head and tail are atomics — store with release on the producer side, load with acquire on the consumer. The Disruptor design is the canonical reference.
`head.store(next, release)` on publish; `head.load(acquire)` on consume. Relaxed ordering on the counter, ordered handoff on the cursor.
High-frequency counters use `fetch_add(1, relaxed)`. The cost is ~50 ns per increment on M1 Pro — fine for anything under ~10 M ops/s per core.
Packets received, messages published, rejects, acknowledgements — one counter per thread aggregated periodically beats a single contended counter past a few hundred thousand ops/s.
A single-writer atomic `bool is_paused` lets the hot path branch in a few cycles. Compare-and-swap is reserved for the rare state transition.
Feed handler sets `is_paused.store(true, release)`. The order sender reads `is_paused.load(acquire)` once per loop iteration. Zero locks on the hot path.
A mutex protects a multi-field state update that must move together — position + last-trade timestamp + exposure cap. Atomics don't compose across fields.
`std::lock_guard<std::mutex> g(state_mu); position += qty; last_trade = now; cap_used += notional;` — the three writes are seen together or not at all.
On the hot path of a single-threaded order book, a mutex per match adds ~110 ns per increment on M1 Pro. Single-thread ownership or message passing is cheaper at the same correctness.
An LMAX-style Disruptor ring buffer with one consumer thread per book sustains millions of ops/sec without ever acquiring a lock on the hot path.
For state mutated at human pace — configuration reloads, admin commands, infrequently-touched caches — mutex is the simplest correct choice. Atomic protocols are overhead the latency budget doesn't notice.
A `RwLock<HashMap<...>>` around a feature-flag cache reloaded once per minute. Readers are common, writers are rare, contention is negligible.