Shared Counter · Deep Dive
Full code, memory ordering, pitfalls, tradeoff analysis. The walkthrough shows what happens; this page explains why.
Overview
The shared counter is the smallest possible concurrent program: two threads each add 1 to the same integer. Its appeal as a teaching tool is that the bug is almost invisible in source form (one line of arithmetic) but completely visible in the trace.
Everything that comes later — atomics, memory ordering, lock-free data structures, RCU, sharded counters — exists because programs more interesting than this one still need to coordinate around shared state. Understand this and the rest is just engineering against the same forces.
Non-Atomic
View walkthrough ›1int counter = 0;
2
3void worker() {
4 int local = counter; // read
5 local = local + 1; // compute
6 counter = local; // write
7}There is no memory ordering here, and that is the entire problem. Each thread sees the program as three steps — read, compute, write — but the hardware and compiler are free to reorder, batch, or interleave those steps across threads in any way that preserves single-thread semantics.
On x86 the writes are at least eventually consistent due to the strong hardware memory model. On ARM, AArch64, and POWER, the writes can be reordered even further. None of that matters for correctness here, because the bug shows up at the instruction-stream level long before the memory model gets a chance to ruin anything.
Adding volatile does not fix this. volatile only forbids the compiler
from optimising the load and store away — it does not make the
read-modify-write indivisible. This is a common interview question, and the
answer is "no."
Adding
volatiledoes not make the operation atomic — it only blocks the compiler from caching the value in a register.Mutex-less optimization in a single-writer/multi-reader pattern needs
std::atomicon the writer side too — otherwise the read can tear or see torn writes.On most ABIs an aligned 64-bit load/store is atomic at the hardware level, but the standard does not guarantee it — relying on this is non-portable.
A debug build may serialise the threads enough that the bug disappears. Run benchmarks in release mode with optimisation.
The benchmark in benchmarks/cpp/shared-counter/non_atomic.cpp runs each
thread N = 1,000,000 increments and reports the final counter value alongside
the throughput. We expect the result to be wrong — the test asserts
final_count < 2 * N to confirm that lost updates actually happen on this
machine. If you ever see final_count == 2 * N on a strongly-ordered
architecture, raise N or increase thread count until the race shows up.
Numbers and chart for this approach appear in the Measured throughput section below.
Atomic
View walkthrough ›1#include <atomic>
2
3std::atomic<int64_t> counter{0};
4
5void worker() {
6 counter.fetch_add(1, std::memory_order_relaxed);
7}std::atomic<int64_t>::fetch_add issues a single read-modify-write
instruction the hardware guarantees as indivisible — on x86 this is
LOCK XADD, on AArch64 it is either LDADD (from ARMv8.1-A) or an
LDXR/STXR retry loop.
The memory_order argument controls what other memory operations can move across this one:
memory_order_relaxed— the RMW is still atomic, but unrelated loads and stores in this thread are free to reorder around it. Right for pure counters where you do not care when other threads "see" the new value.memory_order_acquire/release— pair an atomic on the consumer with one on the producer to publish a piece of memory (e.g. a sequence number guarding access to a buffer).memory_order_seq_cst— the default. Slowest but easiest to reason about. Insertion of a global ordering across all sequentially-consistent atomic operations in the program.
For a pure counter you almost always want relaxed. Choosing seq_cst by
default is one of the most common micro-pessimisations in low-latency code.
Mixing relaxed atomics with non-atomic reads of related data is unsafe — you need acquire/release if other state must be visible together.
fetch_addon a tiny counter is great;compare_exchange_strongloops on a contended cache line can be slower than a mutex because every retry causes coherence traffic.On x86 every atomic RMW is effectively seq_cst at the hardware level, so the C++ memory order only affects compiler reorderings. ARM behaves differently — your benchmarks need to reflect the target platform.
std::atomic<T>::is_lock_free()should be true forint32_t/int64_ton every mainstream platform. If it isn't, the standard library is faking atomicity with a mutex underneath.
The benchmark runs fetch_add(1, std::memory_order_relaxed) in a tight loop
across N threads. We report ops/sec and a per-op latency derived from
benchmark::State::iterations(). Under low contention (1–2 threads) atomic is
the throughput leader by a wide margin over mutex. Under heavy contention
(8+ threads on the same cache line) throughput collapses for both — this is
not a problem with atomics specifically, it is the cache-line ping-pong that
False Sharing (next concept) will explain.
Numbers and chart for this approach appear in the Measured throughput section below.
Mutex
View walkthrough ›1#include <mutex>
2
3int64_t counter = 0;
4std::mutex m;
5
6void worker() {
7 std::lock_guard<std::mutex> lock(m);
8 counter = counter + 1;
9}A mutex acquire is a full memory barrier on every architecture. Anything the unlocking thread wrote before it released the lock is visible to the next thread that acquires it. This is why mutex-protected code can use plain non-atomic loads and stores: the lock acquire/release acts as the acquire/release pair.
std::mutex on macOS and Linux uses pthread_mutex_t, which is
backed by a futex on Linux. Uncontended lock + unlock is around 25 ns on
modern hardware. Contended lock involves a kernel-level wait queue and
can cost several microseconds, with tail-latency spikes if the kernel
decides to deschedule the waiting thread.
std::lock_guard is the right wrapper for most cases — std::unique_lock
adds flexibility (deferred locking, timed locks, transferring ownership)
at the cost of slightly more code. std::scoped_lock (C++17) is the
multi-mutex version and avoids deadlock by ordering acquires.
Forgetting
lock_guardand writingm.lock(); ...; m.unlock()by hand — an early return or exception leaks the lock.Recursive locking —
std::mutexis not reentrant. Reentrant locking patterns usually point to a design problem; if you really need it, usestd::recursive_mutex.Holding a mutex across an I/O call, allocation, or blocking primitive — the lock waits for the slow operation and every other thread waits for the lock.
Priority inversion in real-time systems: a low-priority holder blocks a high-priority waiter. PI mutexes exist but are rare in userland code.
Sleeping on a contended mutex involves a syscall. Spinning briefly first via
std::atomic_flagcan win for very short critical sections — but tune carefully or you waste cycles.
The mutex benchmark wraps the increment in a std::lock_guard<std::mutex>.
We expect throughput to be 5–10x lower than the atomic version even under
low contention because of the lock acquire/release overhead, with the gap
widening as threads are added — past 4 threads the kernel wait queue
dominates and tail latency spikes by orders of magnitude. The p99 column
on the comparison table is where this hurts most.
Numbers and chart for this approach appear in the Measured throughput section below.
Measured throughput
4 threads each running 1,000,000 increments. Bars show the median of 10 repetitions. Dashed bars are non-atomic — fast, but the final counter is wrong, so the number is misleading by design.
| Approach | Lang | Throughput (median) | Per-op | Final counter | Correct |
|---|---|---|---|---|---|
| Non-Atomic | C++ | 1597.5 Mops/s | 2.5 ns | 1,026,290(-74%) | incorrect |
| Atomic | C++ | 75.6 Mops/s | 52.9 ns | 4,000,000 | correct |
| Mutex | C++ | 35.6 Mops/s | 112.4 ns | 4,000,000 | correct |
| Non-Atomic | Rust | 2283.6 Mops/s | 1.8 ns | 1,024,559(-74%) | incorrect |
| Atomic | Rust | 86.4 Mops/s | 46.3 ns | 4,000,000 | correct |
| Mutex | Rust | 38.3 Mops/s | 104.5 ns | 4,000,000 | correct |
Tradeoffs
The three approaches are not really competing — they answer different questions.
- Non-atomic is a learning artefact. It exists so you can see what goes wrong when synchronisation is missing. Do not ship it.
- Atomic is the right choice when the shared state is a single primitive (counter, flag, sequence number, pointer) and the operation fits one of the standard atomic primitives. Hot-path-friendly.
- Mutex is the right choice when the critical section touches multiple fields that must move together, when the work inside is non-trivial, or when correctness clarity matters more than nanosecond-level latency.
The most common mistake is reaching for atomics on data structures that are too complex for a single atomic op (linked lists, hash tables, ring buffers with multi-field state). That path leads to compare-exchange retry loops, the ABA problem, and lock-free data-structure work that takes weeks to get right. Use a mutex unless you have a reason not to — then if you do have a reason (you measured contention dominating your latency budget), reach for the right tool: per-thread shards, RCU, single-ownership + message passing, or a well-studied lock-free design.
There is also a fourth option not on this page: avoid sharing. Many problems that look like they need synchronisation are better solved by giving each thread its own copy and merging at a coarser cadence. This is the design philosophy behind LMAX Disruptor, Seastar, and most modern exchange matching engines.