Building a distributed key-value store
Implement the quorum read/write paths, the LSM-tree storage engine, version-vector conflict detection, and Merkle-tree anti-entropy repair.
The write path (quorum + hinted handoff)
A coordinator node sends the write to the N preference-list replicas and waits for W acks:
def put(key, value, ctx): # ctx = version vector
replicas = ring.preference_list(key, N)
version = ctx.increment(node_id) # bump this node's counter
record = Record(value, version, ts=now())
acks = 0
for r in replicas:
if r.alive:
if r.write(key, record): acks += 1
else:
hinted_handoff(r, key, record) # stash hint on a standby
if acks >= W: return OK
raise WriteFailed() # < W acks
On each replica, the write first appends to a commit log (durability), then updates the in-memory memtable; the memtable flushes to an immutable SSTable when full.
The read path (quorum + read repair)
def get(key):
replicas = ring.preference_list(key, N)
responses = [r.read(key) for r in replicas[:R_plus] if r.alive]
wait_for(R, responses) # need R responses
newest = resolve(responses) # by version vector
if divergent(responses):
async_read_repair(key, newest, replicas) # fix stale replicas in bg
return newest
The LSM storage engine
Per-node storage tuned for writes:
- Memtable — sorted in-memory structure; writes go here (fast).
- Commit log — append-only WAL written before the memtable, so a crash replays.
- SSTables — when the memtable fills, flush it to an immutable sorted file on disk. Reads check memtable → SSTables newest-first.
- Bloom filter per SSTable — skip files that definitely lack the key (Chapter 2), so a read miss doesn’t scan every file.
- Compaction — merge SSTables in the background to reclaim space and drop tombstones (deleted keys).
This is why these stores excel at writes: every write is a sequential append, never a random in-place update.
Conflict detection with version vectors
A version vector is a map node_id → counter. Compare two versions:
def compare(a, b):
a_dom = all(a[n] >= b.get(n,0) for n in a) and a != b
b_dom = all(b[n] >= a.get(n,0) for n in b) and a != b
if a_dom: return "a_newer"
if b_dom: return "b_newer"
return "concurrent" # conflict → merge or return siblings to the app
concurrent means neither causally precedes the other → a genuine conflict the
application (or a CRDT merge) must reconcile. Last-write-wins instead picks the
higher timestamp — simpler, but silently discards one write.
Anti-entropy with Merkle trees
Replicas drift (missed writes, hinted handoffs not yet delivered). Each replica builds a Merkle tree over its key ranges; two replicas compare root hashes, and only descend into subtrees that differ — so they find divergent keys with minimal data transfer, then sync just those. (Chapter 2’s checksums/Merkle idea.)
Membership and failure detection
Nodes run a gossip protocol to share membership and a phi-accrual-style failure detector (adaptive heartbeats, Chapter 2) so the ring stays current as nodes join, leave, or die — and hinted handoffs know where to deliver.
Failure handling
- Replica down on write → hinted handoff keeps the write available; delivered on recovery.
- Stale read → read repair + anti-entropy converge replicas.
- Coordinator down → any node can coordinate; the client retries another.
- Datacenter loss → replicas on other DCs serve; quorum tuned to survive it.
The takeaway
Concrete signals: quorum W/R/N writes and reads, commit-log + LSM storage with Bloom filters and compaction, version vectors for conflicts, hinted handoff + read repair + Merkle anti-entropy for availability and convergence, and gossip membership. This durable, leaderless store is the backbone many later systems shard their data onto.