ST
StateTrace
Visual Quant & Low-Latency Systems Lab
GitHub
Concepts/Synchronization/Shared Counter/Deep Dive

Shared Counter · Deep Dive

Full code, memory ordering, pitfalls, tradeoff analysis. The walkthrough shows what happens; this page explains why.

Overview

The shared counter is the smallest possible concurrent program: two threads each add 1 to the same integer. Its appeal as a teaching tool is that the bug is almost invisible in source form (one line of arithmetic) but completely visible in the trace.

Everything that comes later — atomics, memory ordering, lock-free data structures, RCU, sharded counters — exists because programs more interesting than this one still need to coordinate around shared state. Understand this and the rest is just engineering against the same forces.

Non-Atomic
1int counter = 0;
2
3void worker() {
4    int local = counter;  // read
5    local = local + 1;    // compute
6    counter = local;      // write
7}
Memory ordering

There is no memory ordering here, and that is the entire problem. Each thread sees the program as three steps — read, compute, write — but the hardware and compiler are free to reorder, batch, or interleave those steps across threads in any way that preserves single-thread semantics.

On x86 the writes are at least eventually consistent due to the strong hardware memory model. On ARM, AArch64, and POWER, the writes can be reordered even further. None of that matters for correctness here, because the bug shows up at the instruction-stream level long before the memory model gets a chance to ruin anything.

Adding volatile does not fix this. volatile only forbids the compiler from optimising the load and store away — it does not make the read-modify-write indivisible. This is a common interview question, and the answer is "no."

Pitfalls
  • Adding volatile does not make the operation atomic — it only blocks the compiler from caching the value in a register.

  • Mutex-less optimization in a single-writer/multi-reader pattern needs std::atomic on the writer side too — otherwise the read can tear or see torn writes.

  • On most ABIs an aligned 64-bit load/store is atomic at the hardware level, but the standard does not guarantee it — relying on this is non-portable.

  • A debug build may serialise the threads enough that the bug disappears. Run benchmarks in release mode with optimisation.

Benchmark methodology

The benchmark in benchmarks/cpp/shared-counter/non_atomic.cpp runs each thread N = 1,000,000 increments and reports the final counter value alongside the throughput. We expect the result to be wrong — the test asserts final_count < 2 * N to confirm that lost updates actually happen on this machine. If you ever see final_count == 2 * N on a strongly-ordered architecture, raise N or increase thread count until the race shows up.

Measured results

Numbers and chart for this approach appear in the Measured throughput section below.

Atomic
1#include <atomic>
2
3std::atomic<int64_t> counter{0};
4
5void worker() {
6    counter.fetch_add(1, std::memory_order_relaxed);
7}
Memory ordering

std::atomic<int64_t>::fetch_add issues a single read-modify-write instruction the hardware guarantees as indivisible — on x86 this is LOCK XADD, on AArch64 it is either LDADD (from ARMv8.1-A) or an LDXR/STXR retry loop.

The memory_order argument controls what other memory operations can move across this one:

  • memory_order_relaxed — the RMW is still atomic, but unrelated loads and stores in this thread are free to reorder around it. Right for pure counters where you do not care when other threads "see" the new value.
  • memory_order_acquire / release — pair an atomic on the consumer with one on the producer to publish a piece of memory (e.g. a sequence number guarding access to a buffer).
  • memory_order_seq_cst — the default. Slowest but easiest to reason about. Insertion of a global ordering across all sequentially-consistent atomic operations in the program.

For a pure counter you almost always want relaxed. Choosing seq_cst by default is one of the most common micro-pessimisations in low-latency code.

Pitfalls
  • Mixing relaxed atomics with non-atomic reads of related data is unsafe — you need acquire/release if other state must be visible together.

  • fetch_add on a tiny counter is great; compare_exchange_strong loops on a contended cache line can be slower than a mutex because every retry causes coherence traffic.

  • On x86 every atomic RMW is effectively seq_cst at the hardware level, so the C++ memory order only affects compiler reorderings. ARM behaves differently — your benchmarks need to reflect the target platform.

  • std::atomic<T>::is_lock_free() should be true for int32_t / int64_t on every mainstream platform. If it isn't, the standard library is faking atomicity with a mutex underneath.

Benchmark methodology

The benchmark runs fetch_add(1, std::memory_order_relaxed) in a tight loop across N threads. We report ops/sec and a per-op latency derived from benchmark::State::iterations(). Under low contention (1–2 threads) atomic is the throughput leader by a wide margin over mutex. Under heavy contention (8+ threads on the same cache line) throughput collapses for both — this is not a problem with atomics specifically, it is the cache-line ping-pong that False Sharing (next concept) will explain.

Measured results

Numbers and chart for this approach appear in the Measured throughput section below.

Mutex
1#include <mutex>
2
3int64_t counter = 0;
4std::mutex m;
5
6void worker() {
7    std::lock_guard<std::mutex> lock(m);
8    counter = counter + 1;
9}
Memory ordering

A mutex acquire is a full memory barrier on every architecture. Anything the unlocking thread wrote before it released the lock is visible to the next thread that acquires it. This is why mutex-protected code can use plain non-atomic loads and stores: the lock acquire/release acts as the acquire/release pair.

std::mutex on macOS and Linux uses pthread_mutex_t, which is backed by a futex on Linux. Uncontended lock + unlock is around 25 ns on modern hardware. Contended lock involves a kernel-level wait queue and can cost several microseconds, with tail-latency spikes if the kernel decides to deschedule the waiting thread.

std::lock_guard is the right wrapper for most cases — std::unique_lock adds flexibility (deferred locking, timed locks, transferring ownership) at the cost of slightly more code. std::scoped_lock (C++17) is the multi-mutex version and avoids deadlock by ordering acquires.

Pitfalls
  • Forgetting lock_guard and writing m.lock(); ...; m.unlock() by hand — an early return or exception leaks the lock.

  • Recursive locking — std::mutex is not reentrant. Reentrant locking patterns usually point to a design problem; if you really need it, use std::recursive_mutex.

  • Holding a mutex across an I/O call, allocation, or blocking primitive — the lock waits for the slow operation and every other thread waits for the lock.

  • Priority inversion in real-time systems: a low-priority holder blocks a high-priority waiter. PI mutexes exist but are rare in userland code.

  • Sleeping on a contended mutex involves a syscall. Spinning briefly first via std::atomic_flag can win for very short critical sections — but tune carefully or you waste cycles.

Benchmark methodology

The mutex benchmark wraps the increment in a std::lock_guard<std::mutex>. We expect throughput to be 5–10x lower than the atomic version even under low contention because of the lock acquire/release overhead, with the gap widening as threads are added — past 4 threads the kernel wait queue dominates and tail latency spikes by orders of magnitude. The p99 column on the comparison table is where this hurts most.

Measured results

Numbers and chart for this approach appear in the Measured throughput section below.

Measured throughput

Apple M1 Pro · 10c · Darwin arm64 · 2026-05-13

4 threads each running 1,000,000 increments. Bars show the median of 10 repetitions. Dashed bars are non-atomic — fast, but the final counter is wrong, so the number is misleading by design.

ApproachLangThroughput (median)Per-opFinal counterCorrect
Non-AtomicC++1597.5 Mops/s2.5 ns1,026,290(-74%)incorrect
AtomicC++75.6 Mops/s52.9 ns4,000,000correct
MutexC++35.6 Mops/s112.4 ns4,000,000correct
Non-AtomicRust2283.6 Mops/s1.8 ns1,024,559(-74%)incorrect
AtomicRust86.4 Mops/s46.3 ns4,000,000correct
MutexRust38.3 Mops/s104.5 ns4,000,000correct

Tradeoffs

The three approaches are not really competing — they answer different questions.

  • Non-atomic is a learning artefact. It exists so you can see what goes wrong when synchronisation is missing. Do not ship it.
  • Atomic is the right choice when the shared state is a single primitive (counter, flag, sequence number, pointer) and the operation fits one of the standard atomic primitives. Hot-path-friendly.
  • Mutex is the right choice when the critical section touches multiple fields that must move together, when the work inside is non-trivial, or when correctness clarity matters more than nanosecond-level latency.

The most common mistake is reaching for atomics on data structures that are too complex for a single atomic op (linked lists, hash tables, ring buffers with multi-field state). That path leads to compare-exchange retry loops, the ABA problem, and lock-free data-structure work that takes weeks to get right. Use a mutex unless you have a reason not to — then if you do have a reason (you measured contention dominating your latency budget), reach for the right tool: per-thread shards, RCU, single-ownership + message passing, or a well-studied lock-free design.

There is also a fourth option not on this page: avoid sharing. Many problems that look like they need synchronisation are better solved by giving each thread its own copy and merging at a coarser cadence. This is the design philosophy behind LMAX Disruptor, Seastar, and most modern exchange matching engines.

References