Designing a distributed key-value store

Build a Dynamo-style durable key-value database — partitioning, replication with tunable quorums, and how it stays available during partitions.

The problem

Design a durable, distributed key-value store (think DynamoDB / Cassandra): a database exposing get(key) / put(key, value) that scales to huge data and write volume across a cluster, survives node and datacenter failures, and stays available. Unlike the cache, this is the source of truth — data must persist.

Step 1 — Requirements

Functional: get(key), put(key, value), delete(key); data durably stored and replicated; scales horizontally; runs across datacenters.

Non-functional: high availability (Dynamo’s defining goal — “always writeable”), horizontal scalability, partition tolerance, tunable consistency, and low latency. By CAP this is an AP system that leans on eventual consistency.

Step 2 — Partitioning

Spread keys across nodes with consistent hashing + virtual nodes so the cluster grows/shrinks with minimal data movement and load stays even (Chapter 2). Each key hashes to a position on the ring; the node clockwise owns it.

Step 3 — Replication

Each key is replicated to the N nodes clockwise from its position (the “preference list”), skipping virtual duplicates so replicas land on distinct physical nodes and, ideally, distinct racks/datacenters. This gives durability and availability: any replica can serve the key.

Step 4 — Tunable consistency with quorums

The heart of the design (Chapter 2’s quorum lesson):

N = replicas, W = nodes that must ack a write, R = nodes read.
W + R > N → strong consistency (reads see the latest write).
W + R ≤ N → faster, more available, but possibly stale.

Let callers tune per operation: W=1 for fast writes, R=1 for fast reads, or R=W=quorum for consistency. This is the AP/EL knob made concrete.

Step 5 — Handling failures (the interesting part)

Hinted handoff — if a replica node is down during a write, another node temporarily holds a “hint” and delivers it when the node returns. Keeps writes always available during transient failures.
Read repair — on a read, if replicas disagree, return the newest and asynchronously fix the stale ones.
Anti-entropy (Merkle trees) — replicas periodically compare Merkle trees to find and repair divergence efficiently (Chapter 2’s checksums lesson).

Step 6 — Conflict resolution

With leaderless writes, concurrent updates to one key conflict. Resolve with:

Version vectors / vector clocks — track causality; detect concurrent versions and either merge or return both (“siblings”) to the app.
Last-write-wins — simplest (by timestamp), but can silently drop a write.

Storage engine

Each node stores data in an LSM-tree (memtable + SSTables on disk) — optimized for high write throughput via sequential appends, with a commit log for durability and Bloom filters to skip SSTables on reads that would miss.

Trade-offs to raise

Consistency vs availability/latency — the whole R/W/N knob; default to AP + eventual, tighten where needed.
LWW simplicity vs lost writes — version vectors are correct but complex.
Write availability via hinted handoff vs the complexity of reconciliation.

The interview cue

“Consistent hashing to partition, N replicas on the preference list across racks, tunable R/W/N quorums for the consistency/latency trade, hinted handoff + read repair + Merkle anti-entropy for availability and convergence, version vectors for conflicts, and an LSM-tree engine for write throughput.” That’s the Dynamo paper in one breath; implementation of the write/read paths is next.