ST
StateTrace
Visual Quant & Low-Latency Systems Lab
GitHub
Curriculum/mutex

Mutex

synchronization·L1 · combinator
Replacesthe belief that mutexes are inherently slow.

Uncontended acquire is around 25 ns on modern hardware; contended p99 acquire can reach milliseconds when the kernel deschedules the holder. The slowness is the contention tail, not the primitive itself. A mutex serialises access to shared state via an acquire–release pair — a combinator over atomic operations and the OS scheduler's wait queue.

What a mutex is

A mutex gives exactly one thread at a time the right to execute a critical section — a region of code that touches shared mutable state. Acquire takes the right; release returns it. Between acquire and release, no other thread can hold the same mutex, so the section is effectively single-threaded. The atomic operation underneath is a compare-and-swap on the mutex's internal state; the scheduler intervenes only when a thread cannot acquire because the lock is held.

The simplest use

Same setup as Shared Counter: two threads, one int64_t, one +1 each, one million iterations. Apple M1 Pro, 4 threads: the non-atomic version loses 74% of writes. Wrap the increment in std::lock_guard, and both threads serialise on the lock — every increment lands. The trace below walks the mutex version step by step.

Serialised, observed

Walk the mutex trace. T1's increment fully completes — read, compute, write — before T2's acquire returns. The critical section runs to completion before the next thread enters. Correctness is identical to the atomic version; throughput is lower because each acquire-release pair is more expensive than a single `LOCK XADD`.

traceshared-counter/mutexOpen walkthrough

Acquire pairs with release as a memory fence

Every mutex acquire is at minimum an acquire fence; every release is at minimum a release fence. Together, the acquire–release pair guarantees that any write made by the previous holder before release is visible to the next acquirer.

This is why code inside a mutex can use plain non-atomic loads and stores — the pair acts as the fence around the entire critical section. On x86 (Total Store Order), atomic operations are effectively full barriers and the cost is paid up front; on ARM and POWER, the fences are one-way and the cost is correspondingly lower.

The trade is throughput for reasoning simplicity. One synchronisation guarantee covers a section, not an access.

What contention looks like

Uncontended lock + unlock is around 25 ns on modern hardware. The atomic CAS succeeds in userspace on the first try — Linux's futex is fast userspace mutex, with no syscall when uncontended; macOS's os_unfair_lock is similar. The scheduler is not involved.

Contended lock is a different cost regime. The failing thread is descheduled, the holder eventually releases, the kernel wakes a waiter via FUTEX_WAIT / FUTEX_WAKE (Linux) or the equivalent. Round-trip on contention is several microseconds — two orders of magnitude more than uncontended. Under heavy contention (8+ threads on the same lock), tail latency dominates: p99 acquire time can reach milliseconds when the kernel deschedules the holder mid-section. The Shared Counter benchmark exposes this regime when thread count rises.

Canonical usage — std::mutex + std::lock_guard

cpp
#include <mutex>
#include <thread>
#include <cstdint>

std::mutex m;
int64_t counter = 0;

void increment(int n) {
    for (int i = 0; i < n; ++i) {
        std::lock_guard<std::mutex> guard(m);   // acquire on construct
        counter += 1;                            // critical section
    }                                            // release on destruct (RAII)
}

// Two threads, one million increments each.
// Final counter == 2,000,000 exactly. Throughput is 5–10× lower than the
// atomic_fetch_add version because every increment now pays for a lock.

The mutex zoo

Six mutex variants from the C++ standard library, each for a different shape of critical section.

  • std::mutex (C++11) — Default. Binary, non-recursive.
  • std::lock_guard (C++11) — RAII wrapper. Acquire on construct, release on destruct. The canonical usage.
  • std::unique_lock (C++11) — Flexible wrapper. Deferred locking, timed locks, transferable ownership. Useful when the critical section needs to unlock-and-relock or pass ownership across scopes.
  • std::scoped_lock (C++17) — Multi-mutex wrapper. Acquires N mutexes using a deadlock-avoiding algorithm (typically try-lock with backoff and retry in libstdc++/libc++). The right shape when a critical section spans multiple mutexes.
  • std::recursive_mutex (C++11) — Reentrant. The same thread can acquire it multiple times. Recursive patterns usually indicate a design that could be flattened.
  • std::shared_mutex (C++17) — Many-readers / one-writer. Useful when reads dominate writes by a wide margin and the critical section is non-trivial.

The deadlock pattern (lab input)

cpp
// Two threads, two mutexes, opposite acquire order.
std::mutex m1, m2;

void thread_a() {
    std::lock_guard<std::mutex> g1(m1);  // acquire m1
    std::lock_guard<std::mutex> g2(m2);  // then m2
    /* work */
}

void thread_b() {
    std::lock_guard<std::mutex> g2(m2);  // acquire m2
    std::lock_guard<std::mutex> g1(m1);  // then m1
    /* work */
}

Find the deadlock

Look at the deadlock-pattern code above. Answer in writing: (a) the exact interleaving that causes the deadlock; (b) the smallest code change in thread_b that prevents it; (c) why std::scoped_lock(m1, m2) in both threads would prevent it; (d) which of the three fix-shapes from the Race Condition card (serialize / immutable / own-don't-share) applies to deadlock avoidance.

Expected: (a) Thread A acquires m1, thread B acquires m2, thread A blocks waiting on m2 (held by B), thread B blocks waiting on m1 (held by A). Both blocked, neither releases. (b) Reorder thread B's acquires to `m1` then `m2` — same order in both threads. (c) `std::scoped_lock` acquires multiple mutexes using a deadlock-avoiding algorithm — typically try-lock with backoff and retry in libstdc++/libc++. The C++ standard mandates the deadlock-avoidance property, not the specific algorithm. (d) Serialize at a coarser scope — the multi-mutex acquisition itself becomes atomic via either a canonical ordering or a single combining lock.

Bridges
  • scheduler-wait-queueshared mechanism
    Mutex acquire under contention is a scheduler operation, not just a memory operation — the cost shows up in the OS scheduler's wait queue, not in cache traffic. The tail-latency behaviour is the scheduler's, not the lock's.
    Where this shows up
    • Linux CFS parks a contender on the wait queue after the spin budget
    • Tail latency p99 dominated by the wake-up path, not the critical section
    • `perf sched latency` attributes the delay to runqueue residency
  • atomic-rmw-primitivesmodel → implementation
    Atomic read-modify-write operations implement smaller critical sections without a general mutex — same serialise-shape fix, narrower scope, no scheduler involvement when uncontended.
    Where this shows up
    • x86 `lock xadd` for `fetch_add`; ARM uses `ldxr/stxr` retry loop
    • Single instruction guarantees the read-modify-write is indivisible
    • Cost is the cache-line lock, not the instruction count
Done state

Evidence the learner produces, checks that confirm it.

Evidence
  • artifactBenchmark output from `pnpm bench:cpp` showing the mutex run reaches `final_count == 2 * N` and reports per-op latency 5–10× higher than the atomic run.
  • observable behaviorCan articulate the difference between uncontended (~25 ns) and contended (~µs) mutex acquire, and explain why tail latency dominates the contended case (kernel wait queue + descheduling).
Checks
  • command · exit 0pnpm bench:cppexpects output to include: mutex
  • trace · shared-counter/mutexIdentifies the step at which T1's release pairs with T2's acquire — the moment T2 is guaranteed to see T1's write. This is the acquire-release fence in action.
  • manualGiven the deadlock-pattern code from the lab, identifies the interleaving that produces the deadlock and proposes the smallest fix (acquire order). Bonus: explains why `std::scoped_lock` works.
References