Redundancy and replication

Keep copies so a single failure never loses data or takes you down — and choose how much consistency you trade for that safety.

Redundancy vs replication

Redundancy is having spare copies of a component (a standby server, a second power supply) so a failure doesn’t cause an outage. It’s about availability.
Replication is keeping copies of data in sync across nodes. It’s about durability and read scaling, and it’s how you make redundancy real for stateful systems.

Together they remove single points of failure: no one machine’s death loses data or stops the service.

Durability — if one node’s disk dies, the data still exists elsewhere.
Availability — if the primary dies, a replica takes over (failover).
Read throughput — serve reads from many replicas to scale a read-heavy workload.
Locality — keep a replica near users in another region for lower latency.

The core trade-off, and the one to articulate:

Synchronous — the write isn’t acknowledged until replicas have it. No data loss on primary failure, but every write pays the slowest replica’s latency, and a slow/unreachable replica can stall writes.
Asynchronous — the primary acks immediately and ships changes to replicas in the background. Fast and available, but a primary crash can lose the not-yet-replicated tail of writes (and replicas serve slightly stale reads).
Semi-synchronous — a middle ground: wait for at least one replica to confirm, let the rest catch up asynchronously. Bounds data loss without paying for every replica.

This maps straight onto the consistency vs latency/availability tension you’ll see throughout Chapter 3.

Single-leader (primary–replica) — one node takes writes and streams them to read replicas. Simple, no write conflicts; the leader is a write bottleneck and a failover concern.
Multi-leader — several nodes accept writes (e.g. one per region). Better write availability and locality, but concurrent writes can conflict and need resolution.
Leaderless — any replica accepts reads and writes; consistency is achieved with quorums (its own lesson). Highly available, more complex reasoning.

(Primary–replica vs peer-to-peer is examined as a trade-off in Chapter 3.)

When the leader dies, a replica is promoted. The tricky parts to mention:

Detecting the failure (heartbeats — coming up) without false alarms.
Data loss of un-replicated writes under async replication.
Split-brain — if the old leader comes back and two nodes both think they’re leader, you get divergent writes. Avoided with quorums/fencing so only one leader can win.

Naming split-brain unprompted is a reliable senior signal.