Skip to content
System design course
Ch.4 · Designing real systems·how to build it ·8 min read

Building a distributed key-value store

Implement the quorum read/write paths, the LSM-tree storage engine, version-vector conflict detection, and Merkle-tree anti-entropy repair.


The write path (quorum + hinted handoff)

A coordinator node sends the write to the N preference-list replicas and waits for W acks:

def put(key, value, ctx):                          # ctx = version vector
    replicas = ring.preference_list(key, N)
    version = ctx.increment(node_id)               # bump this node's counter
    record = Record(value, version, ts=now())
    acks = 0
    for r in replicas:
        if r.alive:
            if r.write(key, record): acks += 1
        else:
            hinted_handoff(r, key, record)         # stash hint on a standby
    if acks >= W: return OK
    raise WriteFailed()                            # < W acks

On each replica, the write first appends to a commit log (durability), then updates the in-memory memtable; the memtable flushes to an immutable SSTable when full.

The read path (quorum + read repair)

def get(key):
    replicas = ring.preference_list(key, N)
    responses = [r.read(key) for r in replicas[:R_plus] if r.alive]
    wait_for(R, responses)                         # need R responses
    newest = resolve(responses)                    # by version vector
    if divergent(responses):
        async_read_repair(key, newest, replicas)   # fix stale replicas in bg
    return newest

The LSM storage engine

Per-node storage tuned for writes:

  • Memtable — sorted in-memory structure; writes go here (fast).
  • Commit log — append-only WAL written before the memtable, so a crash replays.
  • SSTables — when the memtable fills, flush it to an immutable sorted file on disk. Reads check memtable → SSTables newest-first.
  • Bloom filter per SSTable — skip files that definitely lack the key (Chapter 2), so a read miss doesn’t scan every file.
  • Compaction — merge SSTables in the background to reclaim space and drop tombstones (deleted keys).

This is why these stores excel at writes: every write is a sequential append, never a random in-place update.

Conflict detection with version vectors

A version vector is a map node_id → counter. Compare two versions:

def compare(a, b):
    a_dom = all(a[n] >= b.get(n,0) for n in a) and a != b
    b_dom = all(b[n] >= a.get(n,0) for n in b) and a != b
    if a_dom: return "a_newer"
    if b_dom: return "b_newer"
    return "concurrent"        # conflict → merge or return siblings to the app

concurrent means neither causally precedes the other → a genuine conflict the application (or a CRDT merge) must reconcile. Last-write-wins instead picks the higher timestamp — simpler, but silently discards one write.

Anti-entropy with Merkle trees

Replicas drift (missed writes, hinted handoffs not yet delivered). Each replica builds a Merkle tree over its key ranges; two replicas compare root hashes, and only descend into subtrees that differ — so they find divergent keys with minimal data transfer, then sync just those. (Chapter 2’s checksums/Merkle idea.)

Membership and failure detection

Nodes run a gossip protocol to share membership and a phi-accrual-style failure detector (adaptive heartbeats, Chapter 2) so the ring stays current as nodes join, leave, or die — and hinted handoffs know where to deliver.

Failure handling

  • Replica down on write → hinted handoff keeps the write available; delivered on recovery.
  • Stale read → read repair + anti-entropy converge replicas.
  • Coordinator down → any node can coordinate; the client retries another.
  • Datacenter loss → replicas on other DCs serve; quorum tuned to survive it.

The takeaway

Concrete signals: quorum W/R/N writes and reads, commit-log + LSM storage with Bloom filters and compaction, version vectors for conflicts, hinted handoff + read repair + Merkle anti-entropy for availability and convergence, and gossip membership. This durable, leaderless store is the backbone many later systems shard their data onto.