Skip to content
System design course
Ch.3 · Trade-offs that define a design·concept ·6 min read

Latency vs throughput

Speed of one request versus volume of all requests — two different goals that need two different optimizations, and sometimes pull apart.


Two different questions

  • Latency — how long one request takes. “A redirect returns in 20 ms.” Measured per-operation, and best stated as percentiles (p50, p95, p99), not averages — tail latency is what users feel.
  • Throughput — how many requests the system handles per unit time. “We serve 100k requests/sec.” Measured in aggregate (QPS, records/sec, bytes/sec).

A system can have great throughput and terrible latency (a batch job churns millions of rows but each takes seconds), or low latency and limited throughput (a single fast server that can’t handle many at once). They’re independent dials.

Why they can pull apart

Optimizing one sometimes hurts the other:

  • Batching raises throughput (amortize fixed costs over many items) but adds latency (each item waits for the batch to fill).
  • Queues smooth load and protect throughput under bursts, but a request now waits in line — more latency.
  • Pipelining / parallelism can help both, up to the point of contention.

Optimizing for latency

  • Caching and CDNs — serve from memory / near the user.
  • Precomputation — do the work before the request (e.g. fan-out-on-write timelines), so reads are cheap.
  • Fewer hops — co-locate services, read from a same-region replica, avoid chatty cross-service calls.
  • Right data structures / indexes — turn scans into lookups.

Optimizing for throughput

  • Horizontal scaling — more nodes behind a load balancer.
  • Batching and async processing — amortize overhead; decouple with queues.
  • Connection pooling and back-pressure — keep resources busy without overwhelming them.

The interview cue

State which one this system cares about, because it changes the whole design:

“A trading API is latency-critical — p99 must stay under 50 ms, so I precompute and cache and avoid extra hops. An analytics pipeline is throughput-critical — I’ll batch and process asynchronously and happily trade per-event latency for events-per-second.”

Recognizing that interactive paths optimize latency while bulk/background paths optimize throughput — and that the same system often has both — is the signal. (This is the batch-vs-stream decision in concrete form; see the next lessons.)