Designing a distributed key-value store
Build a Dynamo-style durable key-value database — partitioning, replication with tunable quorums, and how it stays available during partitions.
The problem
Design a durable, distributed key-value store (think DynamoDB / Cassandra): a
database exposing get(key) / put(key, value) that scales to huge data and
write volume across a cluster, survives node and datacenter failures, and stays
available. Unlike the cache, this is the source of truth — data must persist.
Step 1 — Requirements
Functional: get(key), put(key, value), delete(key); data durably stored
and replicated; scales horizontally; runs across datacenters.
Non-functional: high availability (Dynamo’s defining goal — “always writeable”), horizontal scalability, partition tolerance, tunable consistency, and low latency. By CAP this is an AP system that leans on eventual consistency.
Step 2 — Partitioning
Spread keys across nodes with consistent hashing + virtual nodes so the cluster grows/shrinks with minimal data movement and load stays even (Chapter 2). Each key hashes to a position on the ring; the node clockwise owns it.
Step 3 — Replication
Each key is replicated to the N nodes clockwise from its position (the “preference list”), skipping virtual duplicates so replicas land on distinct physical nodes and, ideally, distinct racks/datacenters. This gives durability and availability: any replica can serve the key.
Step 4 — Tunable consistency with quorums
The heart of the design (Chapter 2’s quorum lesson):
- N = replicas, W = nodes that must ack a write, R = nodes read.
- W + R > N → strong consistency (reads see the latest write).
- W + R ≤ N → faster, more available, but possibly stale.
Let callers tune per operation: W=1 for fast writes, R=1 for fast reads, or
R=W=quorum for consistency. This is the AP/EL knob made concrete.
Step 5 — Handling failures (the interesting part)
- Hinted handoff — if a replica node is down during a write, another node temporarily holds a “hint” and delivers it when the node returns. Keeps writes always available during transient failures.
- Read repair — on a read, if replicas disagree, return the newest and asynchronously fix the stale ones.
- Anti-entropy (Merkle trees) — replicas periodically compare Merkle trees to find and repair divergence efficiently (Chapter 2’s checksums lesson).
Step 6 — Conflict resolution
With leaderless writes, concurrent updates to one key conflict. Resolve with:
- Version vectors / vector clocks — track causality; detect concurrent versions and either merge or return both (“siblings”) to the app.
- Last-write-wins — simplest (by timestamp), but can silently drop a write.
Storage engine
Each node stores data in an LSM-tree (memtable + SSTables on disk) — optimized for high write throughput via sequential appends, with a commit log for durability and Bloom filters to skip SSTables on reads that would miss.
Trade-offs to raise
- Consistency vs availability/latency — the whole R/W/N knob; default to AP + eventual, tighten where needed.
- LWW simplicity vs lost writes — version vectors are correct but complex.
- Write availability via hinted handoff vs the complexity of reconciliation.
The interview cue
“Consistent hashing to partition, N replicas on the preference list across racks, tunable R/W/N quorums for the consistency/latency trade, hinted handoff + read repair + Merkle anti-entropy for availability and convergence, version vectors for conflicts, and an LSM-tree engine for write throughput.” That’s the Dynamo paper in one breath; implementation of the write/read paths is next.