Skip to content
System design course
Ch.2 · The building blocks·concept ·8 min read

What makes a distributed system "good"

Scalability, reliability, availability, efficiency, and manageability — the five qualities every design is graded on, and how they differ.


The five qualities

When you justify a design, you’re usually arguing about one of these. Knowing the precise difference between them (especially reliability vs availability) is a senior signal.

Scalability

The ability to handle growth — more users, data, or traffic — by adding resources. A design scales well if doubling the load needs roughly double the resources, not 10×.

  • Horizontal scaling (more nodes) is the goal; it’s only possible if the system avoids shared bottlenecks (a single master, a global lock).
  • Watch for the hidden serial part: if every request must touch one component, that caps your scale no matter how many nodes you add.

Reliability

The probability the system performs correctly over a period — it doesn’t lose data or produce wrong answers, even when components fail. Reliability is bought with redundancy: no single failure should cause incorrect behavior or data loss. A reliable system has no single point of failure.

Availability

The fraction of time the system is up and serving requests, usually in “nines”:

AvailabilityDowntime/year
99%~3.65 days
99.9%~8.8 hours
99.99%~52 minutes
99.999%~5 minutes

Reliability vs availability — the distinction interviewers love:

  • A system can be available but unreliable (it responds, but with wrong or lost data).
  • A reliable system is generally available, but you can sacrifice availability deliberately (refuse writes during a partition) to protect correctness.

Each extra nine costs real money (more redundancy, more ops). Pick the target the requirements justify — a bank ledger and a meme cache are not the same.

Efficiency

How well the system uses its resources — typically expressed as latency (time per operation) and throughput (operations per second). Covered as a trade-off in Chapter 3; the short version is that you optimize for whichever the requirements care about, and they often pull against each other.

Manageability (operability)

How easy the system is to run, observe, and repair: deploys, monitoring, debugging, handling failures. Often undervalued in interviews, but mentioning it (“I’d add metrics and health checks so we can detect a degraded shard”) is a maturity signal. A clever design no one can operate is a bad design.

How to use this in an interview

Open by naming which of these this problem stresses: “This is availability- and scalability-critical but can tolerate eventual consistency.” That single sentence frames every later decision — and tells the interviewer you know what you’re optimizing for.