How distributed systems fit together

Why we split one machine into many, the vocabulary you'll reuse all course, and the new failure modes that come free with the bargain.

Why distribute at all

A single server is simple and, for small systems, correct. You distribute only when one machine can’t keep up on one of three axes:

Compute — more requests/sec than a box can serve.
Storage — more data than a box can hold.
Availability — one box means one failure takes everything down.

Distributing solves these by adding machines — and in return hands you coordination, consistency, and partial-failure problems that didn’t exist before. Most of this chapter is the machinery for managing that trade.

Scale up vs scale out

Vertical scaling (scale up) — a bigger machine (more CPU, RAM, disk). Simple, no code changes, but there’s a ceiling and a single point of failure.
Horizontal scaling (scale out) — more machines working together. Nearly unbounded and fault-tolerant, but you now need load balancing, partitioning, and replication to make many boxes act like one.

Real systems scale up first (it’s cheap and easy) and scale out when they must. Interviews are almost always about scaling out.

The vocabulary you’ll reuse everywhere

Node / instance — one running server.
Cluster — a group of nodes cooperating as one logical service.
Replica — a copy of data (or a service) kept for availability or read throughput.
Partition / shard — a slice of the data living on a subset of nodes.
Stateless service — keeps no per-client state between requests, so any node can handle any request (easy to scale). See the stateful-vs-stateless trade-off in Chapter 3.
Stateful service — holds data that must live somewhere specific (a database, a cache node).

The recurring trick: push state down into a few well-managed stateful systems (databases, caches) and keep the layers above them stateless so you can add and remove them freely.

The new problem: partial failure

On one machine, things work or crash together. In a distributed system, some parts fail while others keep running — a node dies, a network link drops, a disk corrupts one replica. You can’t tell a slow node from a dead one. Nearly every building block ahead exists to detect, tolerate, or route around partial failure:

Replication so a dead node doesn’t lose data,
Heartbeats to notice a node is gone,
Quorums to agree despite some nodes being unreachable,
Checksums to catch silent corruption.

Keep that framing — most building blocks are answers to “what happens when part of this breaks?”