What makes a distributed system "good"
Scalability, reliability, availability, efficiency, and manageability — the five qualities every design is graded on, and how they differ.
The five qualities
When you justify a design, you’re usually arguing about one of these. Knowing the precise difference between them (especially reliability vs availability) is a senior signal.
Scalability
The ability to handle growth — more users, data, or traffic — by adding resources. A design scales well if doubling the load needs roughly double the resources, not 10×.
- Horizontal scaling (more nodes) is the goal; it’s only possible if the system avoids shared bottlenecks (a single master, a global lock).
- Watch for the hidden serial part: if every request must touch one component, that caps your scale no matter how many nodes you add.
Reliability
The probability the system performs correctly over a period — it doesn’t lose data or produce wrong answers, even when components fail. Reliability is bought with redundancy: no single failure should cause incorrect behavior or data loss. A reliable system has no single point of failure.
Availability
The fraction of time the system is up and serving requests, usually in “nines”:
| Availability | Downtime/year |
|---|---|
| 99% | ~3.65 days |
| 99.9% | ~8.8 hours |
| 99.99% | ~52 minutes |
| 99.999% | ~5 minutes |
Reliability vs availability — the distinction interviewers love:
- A system can be available but unreliable (it responds, but with wrong or lost data).
- A reliable system is generally available, but you can sacrifice availability deliberately (refuse writes during a partition) to protect correctness.
Each extra nine costs real money (more redundancy, more ops). Pick the target the requirements justify — a bank ledger and a meme cache are not the same.
Efficiency
How well the system uses its resources — typically expressed as latency (time per operation) and throughput (operations per second). Covered as a trade-off in Chapter 3; the short version is that you optimize for whichever the requirements care about, and they often pull against each other.
Manageability (operability)
How easy the system is to run, observe, and repair: deploys, monitoring, debugging, handling failures. Often undervalued in interviews, but mentioning it (“I’d add metrics and health checks so we can detect a degraded shard”) is a maturity signal. A clever design no one can operate is a bad design.
How to use this in an interview
Open by naming which of these this problem stresses: “This is availability- and scalability-critical but can tolerate eventual consistency.” That single sentence frames every later decision — and tells the interviewer you know what you’re optimizing for.